Today, when the `Fixes` prefix is missing or the developer is not a collaborator with `scylladbbot` we remove the backport labels to prevent the process from starting and notifying the developers.
Developers are worried that removing these backport labels will cause us to forget we need to do these backports. @nyh suggested to add a `scylladbbot/backport_error` label instead
Applied those changes, so when a `Fixes` prefix is missing we will add a `scylladbbot/backport_error` label and stop the process
When a user doesn't accept the invite we will still open the PR but he will not be assigned and will not be able to edit the branch when we have conflicts
Fixes: https://github.com/scylladb/scylla-pkg/issues/4898
Fixes: https://github.com/scylladb/scylla-pkg/issues/4897
Before we were using a marketplace Github action which had some limitations.
With this pull request we are updating the github action using curl option which will gives us full control of the flow instead of relying on pre made github action.
Fixes: scylladb#23088
Closesscylladb/scylladb#23215
Claim that building with CMake files is just 'not supported' instead of
not intended, especially that there are attempts to enable this.
Remove the obsolete mention of the `FOR_IDE` flag.
Closesscylladb/scylladb#22890
This commit adds documentation for zero-token nodes and an explanation
of how to use them to set up an arbiter DC to prevent a quorum loss
in multi-DC deployments.
The commit adds two documents:
- The one in Architecture describes zero-token nodes.
- The other in Cluster Management explains how to use them.
We need separate documents because zero-token nodes may be used
for other purposes in the future.
In addition, the documents are cross-linked, and the link is added
to the Create a ScyllaDB Cluster - Multi Data Centers (DC) document.
Refs https://github.com/scylladb/scylladb/pull/19684
Fixes https://github.com/scylladb/scylladb/issues/20294Closesscylladb/scylladb#21348
In commit 4812a57f, the fmt-based formatter for gossip_digest_syn had
formatting code for cluster_id, partitioner, and group0_id
accidentally commented out, preventing these fields from being included
in the output. This commit restores the formatting by uncommenting the
code, ensuring full visibility of all fields in the gossip_digest_syn
message when logging permits.
This fixes a regression introduced in 4812a57f, which obscured these
fields and reduced debugging insight. Backporting is recommended for
improved observability.
Fixes#23142
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23155
During development of #22428 we decided that we have
no need for `object-storage.yaml`, and we'd rather store
the endpoints in `scylla.yaml` and get a REST api to exopose
the endpoints for free.
This patch removes the credentials provider used to read the
aws keys from this yaml file.
Followup work will remove the `object-storage.yaml` file
altogether and move the endpoints to `scylla.yaml`.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#22951
This action will help preventing next-trigger for running every 15 minutes.
This action will run on push for a specific branch (next, next-enterprise, 2024.x, x.x)
Fixes: scylladb#23088
update action
Closesscylladb/scylladb#23141
The scylla-sstable dump-* command suite has proven invaluable in many investigations. In certain cases however, I found that `dump-data` is quite cumbersome. An example would be trying to find certain values in an sstable, or trying to read the content of system tables when a node is down. For these cases, `dump-data` is very cumbersome: one has to trudge through tons of uninteresting metadata and do compaction in their heads. This PR introduces the new scylla-sstable query command, specifically targeted at situations like this: it allows executing queries on sstables, exposing to the user all the power of CQL, to tailor the output as they see fit.
Select everything from a table:
$ scylla sstable query --system-schema /path/to/data/system_schema/keyspaces-*/*-big-Data.db
keyspace_name | durable_writes | replication
-------------------------------+----------------+-------------------------------------------------------------------------------------
system_replicated_keys | true | ({class : org.apache.cassandra.locator.EverywhereStrategy})
system_auth | true | ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 1})
system_schema | true | ({class : org.apache.cassandra.locator.LocalStrategy})
system_distributed | true | ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 3})
system | true | ({class : org.apache.cassandra.locator.LocalStrategy})
ks | true | ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})
system_traces | true | ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 2})
system_distributed_everywhere | true | ({class : org.apache.cassandra.locator.EverywhereStrategy})
Select everything from a single SSTable, use the JSON output (filtered through [jq](https://jqlang.github.io/jq/) for better readability):
$ scylla sstable query --system-schema --output-format=json /path/to/data/system_schema/keyspaces-*/me-3gm7_127s_3ndxs28xt4llzxwqz6-big-Data.db | jq
[
{
"keyspace_name": "system_schema",
"durable_writes": true,
"replication": {
"class": "org.apache.cassandra.locator.LocalStrategy"
}
},
{
"keyspace_name": "system",
"durable_writes": true,
"replication": {
"class": "org.apache.cassandra.locator.LocalStrategy"
}
}
]
Select a specific field in a specific partition using the command-line:
$ scylla sstable query --system-schema --query "select replication from scylla_sstable.keyspaces where keyspace_name='ks'" ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
replication
-------------------------------------------------------------------------------------
({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})
Select a specific field in a specific partition using ``--query-file``:
$ echo "SELECT replication FROM scylla_sstable.keyspaces WHERE keyspace_name='ks';" > query.cql
$ scylla sstable query --system-schema --query-file=./query.cql ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
replication
-------------------------------------------------------------------------------------
({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})
New functionality: no backport needed.
Closesscylladb/scylladb#22007
* github.com:scylladb/scylladb:
docs/operating-scylla: document scylla-sstable query
test/cqlpy/test_tools.py: add tests for scylla-sstable query
test/cqlpy/test_tools.py: make scylla_sstable() return table name also
scylla-sstable: introduce the query command
tools/utils: get_selected_operation(): use std::string for operation_options
utils/rjson: streaming_writer: add RawValue()
cql3/type_json: add to_json_type()
test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()
There are several API endpoints that walk a specific list of sstables and sum up their bytes_on_disk() values. All those endpoints accumulate a map of sstable names to their sizes, then squashe the maps together and, finally, sum up the map values to report it back. Maintaining these intermediate collections is the waste of CPU and memory, the usage values can be summed up instantly.
Also add a test for per-cf endpoints to validate the change, and generalize the helper functions while at it.
Closesscylladb/scylladb#23143
* github.com:scylladb/scylladb:
api: Generalize disk space counting for table and system
api: Use map_reduce_cf_raw() overload with table name
api: Don't collect sstables map to count disk space usage
test: Add unit test for total/live sstable sizes
There are two semaphores in table for synchronizing changes to sstable list:
sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.
sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.
they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.
problem:
A = tablet cleanup
B = take_snapshot()
1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk
Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.
But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.
That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.
The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.
So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.
Fixes#23049.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23117
To simplify aborting scylla while starting the services,
add a _ready state to stop_signal, so that until
main is ready to be stopped by the abort_source,
just register that the signal is caught, and
let a check() method poll that and request abort
and throw respective exception only then, in controlled
points that are in-between starting of services
after the service started successfully and a deferred
stop action was installed.
This patch prevents gate_closed_exception to escape handling
when start-up is aborted early with the stop signal,
causing https://github.com/scylladb/scylladb/issues/23153
The regression is apparently due to a25c3eaa1c
Fixes https://github.com/scylladb/scylladb/issues/23153
* Requires backport to 2025.1 due to a25c3eaa1cClosesscylladb/scylladb#23103
* github.com:scylladb/scylladb:
main: add checkpoints
main: safely check stop_signal in-between starting services
main: move prometheus start message
main: move per-shard database start message
Since it is requirement for Red Hat OpenShift Certification, we need to
run the container as non-root user.
Related scylladb/scylla-pkg#4858
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Before starting significant services that didn't
have a corresponding call to supervisor::notify
before them.
Fixes#23153
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To simplify aborting scylla while starting the services,
Add a _ready state to stop_signal, so that until
main is ready to be stopped by the abort_source,
just register that the signal is caught, and
let a check() method poll that and request abort
and throw respective exception only then, in controlled
points that are in-between starting of services
after the service started successfully and a deferred
stop action was installed.
Refs #23153
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The `prometheus_server` is started only conditionally
but the notification message is sent and logged
unconditionally.
Move it inside the condtional code block.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we support suite subfolders, there is no
need to create an own suite for object_store and auth_cluster, topology, topology_custom.
this PR merge all these folders into one: 'cluster"
this pr also introduce and apply 'prepare_3_nodes_cluster' fixture that allow preparing non-dirty 3 nodes cluster
that can be reused between tests(for tests that was in topology folder)
number of tests in master
release -3461
dev -3472
debug -3446
number of tests in this PR
release -3460
dev -3471
debug -3445
There is a minus one test in each mode because It was 2 test_topology_failure_recovery files(topology and topology_custom) with the same utility functions but different test cases. This PR merged them into one
Closesscylladb/scylladb#22917
* github.com:scylladb/scylladb:
test.py: merge object_store into cluster folder
test.py: merge auth_cluster into cluster folter
test.py: rename topology_custom folder to cluster
test.py: merge topology test suite into topology_custom
test.py delete conftest in topology_custom
test.py apply prepare_3_nodes_cluster in topology
test.py: introduce prepare_3_nodes_cluster marker
Now when the bodies of both map-reduce reducers are the same, they can
be generalized with each other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing helper that counds disk space usage for a table map-reduces
the table object "by hand". Its peer that counts the usage for all
tables uses the map_reduce_cf_raw() helper. The latter exists for
specific table as well, so the first counter can benefit from using it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the API calls that collect disk usage of sstables accumulate
map<sstable name, disk size>, then merges shard maps into one, then
counts the "disk size" values and drops the map itself on the floor.
This is waste of CPU cycles, disk usage can be just summed up along
cf/sstables iterations, no need to accumulate map with names for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The pair of column_family/metrics/(total|live)_disk_space_used/{name}
reports the disk usage by sstables. The test creates table, populates,
flushes and checks that the size corresonds to what stat(2) reports for
the respective files.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly.
We can drop the voter status changes from the storage_service module because the topology coordinator will handle the votership changes eventually. The calls in the storage_service module were not essential and were only used for optimization (improving the HA under certain conditions).
Furthermore, the other bundled commit improves the reaction again by reacting to the node `on_up()` and `on_down()` events, which again shortens the reaction time and improves the HA.
The change has effect on the timing in the tablets migration test though, as it previously relied on the node being made non-voter from the service_storage `raft_removenode()` function. The fix is to add another server to the topology to make sure we will keep the quorum.
Previously the test worked because the test waits for an injection to be reached and it was ensured that the injection (log line) has only been triggered after the node has been made non-voter from the `raft_removenode()`. This is not the case anymore. An alternative fix would be to wait for the first node to be made non-voter before stopping the second server, but this would make the test more complex (and it is not strictly required to only use 4 servers in the test, it has been only done for optimization purposes).
Fixes: scylladb/scylladb#22860
Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969
No backport: Part of the limited voters new feature, so this shouldn't to be backported.
Closesscylladb/scylladb#22847
* https://github.com/scylladb/scylladb:
raft: use direct return of future for `run_op_with_retry`
raft: adjust the voters interface to allow atomic changes
raft topology: drop removing the node from raft config via storage_service
raft topology: drop changing the raft voters config via storage_service
In commit 2463e524ed, Scylla's default changed
from starting with one tablet per shard to starting 10 per shard. The
functional tests don't need more tablets and it can only slow down the
tests, so the patch added --tablets-initial-scale-factor=1 to test/*/suite.yaml
but forgot to add it to test/cqlpy/run.py (to affect test/cqlpy/run) so
this patch does this now.
This patch should *only* be about making tests faster, although to be
honest, I don't see any measurable improvement in test speed (10 isn't
so many). But, unfortunately, this is only part of the story. Over time
we allowed a few cqlpy tests to be written in a way that relies on having
only a small number of tablets or even exactly one tablet per shard (!).
These tests are buggy and should be fixed - see issues #23115 and #23116
as examples. But adding the option --tablets-initial-scale-factor=1 also
to run.py will make these bugs not affect test/cqlpy/run in the same way
as it doesn't affect test.py.
These buggy tests will still break with `pytest cqlpy` against a Scylla
you ran yourself manually, so eventually will still need to fix those
test bugs.
Refs #23115
Refs #23116Closesscylladb/scylladb#23125
The cqlpy test test_compaction.py::test_compactionstats_after_major_compaction
was written to assume we have just one tablet per shard - if there are many
tablets compaction splitting the data, the test scenario might not need
compaction in the way that the test assumes it does.
Recently (commit 2463e524ed) Scylla's default
was changed to have 10 tablets per shard - not one. This broke this test.
The same commit modified test/cqlpy/suite.yaml, but that affects only test.py
and not test/cqlpy/run, and also not manual runs against a manually-installed
Scylla. If this test absolutely requires a keyspace with 1 and not 10
tablets, then it should create one explicitly. So this is what this test
does (but only if tablets are in use; if vnodes are used that's fine
too).
Before this patch,
test/cqlpy/run test_compaction.py::test_compactionstats_after_major_compaction
fails. After the patch, it passes.
Fixes#23116Closesscylladb/scylladb#23121
The currently used versions of "wasmtime", "idna", "cap-std" and
"cap-primitives" packages had low to moderate security issues.
In this patch we update the dependencies to versions with these
issues fixed.
The update was performed by changing the "wasmtime" (and "wasmtime-wasi")
version in rust/wasmtime_bindings/Cargo.toml and updating rust/Cargo.lock
using the "cargo update" command with the affected package. To fix an
issue with different dependencies having different versions of
sub-dependencies, the package "smallvec" was also updated to "1.13.1".
After the dependency update, the Rust code also needed to be updated
because of the slightly changed API. One Wasm test case needed to be
updated, as it was actually using an incorrect Wat module and not
failing before. The crate also no longer allows multiple tables in
Wasm modules by default - it is now enabled by setting the "gc" crate
feature and configuring the Engine with config.wasm_reference_types(true).
Fixes https://github.com/scylladb/scylladb/issues/23127Closesscylladb/scylladb#23128
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.
Fixes: scylladb/scylladb#22919
Bug is present in all live versions, backports are required.
Closesscylladb/scylladb#23044
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
Range scans are expected to go though lots of tombstones, no need to
spam the logs about this. The tombstone warning log is demoted to debug
level, if somebody wants to see it they can bump the logger to debug
level.
Fixes: https://github.com/scylladb/scylladb/issues/23093Closesscylladb/scylladb#23094
These redundant `std::move()` calls were identified by GCC-14.
In general, copy elision applies to these places, so adding
`std::move()` is not only unnecessary but can actually prevent
the compiler from performing copy elision, as it causes the
return statement to fail to satisfy the requirements for
copy elision optimization.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23063
This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar)
The idea in this series is to create an equivalent of levels with a label.
First, label a subset of the metrics used by the dashboards.
Second, the per-table metrics that are now off by default will be marked with a different label.
The following specific optional features: CDC, CAS, and Alternator have a dedicated label now.
This will allow users to disable all metrics of features that are not in use.
All the rest of the metrics are left unlabeled.
Without any changes, users would get the same metrics they are getting today.
But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies).
The labels are not reported based on the seastar feature of hiding labels that start with an underscore.
Closesscylladb/scylladb#12246
* github.com:scylladb/scylladb:
db/view/view.cc: label metrics with basic_level
transport/server.cc: label metrics with basic_level
service/storage_proxy.cc: label metrics with basic_level and cas
main.cc: label metrics with basic_level
streaming/stream_manager.cc: label metrics with basic_level
repair/repair.cc: label metrics with basic_level
service/storage_service.cc: label metrics with basic_level
gms/gossiper.cc: label metrics with basic_level
replica/database.cc: label metrics with basic_level
cdc/log.cc: label metrics with basic_level and cdc
alternator: label metrics with basic_level and alternator
row_cache.cc: label metrics with basic_level
query_processor.cc: label metrics with basic_level
sstables.cc: label metrics with basic_level
utils/logalloc.cc label metrics with basic_level
commitlog.cc: label metrics with basic_level
compaction_manager.cc: label metrics with basic_level
Adding the __level and features labels
Refs #22916
Adds an "enable_session_tickets" option to TLS setup for our server
endpoints (not documented for internode RPC, as we don't handle it
on the client side there), allowing enabling of TLS3 client session
ticket, i.e. quicker reconnect.
Session tickets are valid within a time frame or until a node
restarts, whichever comes first.
v2:
Use "TLS1.3" in help message
Closesscylladb/scylladb#22928
The following metrics will be marked with basic_level label:
scylla_transport_cql_errors_total
scylla_transport_current_connections
scylla_transport_requests_served
scylla_transport_requests_shed
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The following metrics will be marked with basic_level label:
scylla_scylladb_current_version
scylla_reactor_utilization
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The following metrics will be marked with basic_level label:
scylla_gossip_heart_beat
scylla_gossip_live
scylla_gossip_unreachable
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The following metrics will be marked with basic_level label:
scylla_cdc_operations_failed
scylla_cdc_operations_total
All metrics are labeld with the __cdc label.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The following metrics will be marked with basic_level label:
scylla_alternator_operation
scylla_alternator_op_latency_bucket
scylla_alternator_op_latency_count
scylla_alternator_op_latency_sum
scylla_alternator_total_operations
scylla_alternator_batch_item_count
scylla_alternator_op_latency
scylla_alternator_op_latency_summary
scylla_expiration_items_deleted
All alternator metrics are marked with __alternator label.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The following metrics will be marked with basic_level label:
scylla_sstables_cell_tombstone_writes
scylla_sstables_range_tombstone_reads
scylla_sstables_range_tombstone_writes
scylla_sstables_row_tombstone_reads
scylla_sstables_tombstone_writes
The following metrics will be marked with basic_level label:
scylla_lsa_total_space_bytes
scylla_lsa_non_lsa_used_space_bytes
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The following metrics will be marked with basic_level label:
scylla_commitlog_segments
scylla_commitlog_allocating_segments
scylla_commitlog_unused_segments
scylla_commitlog_alloc
scylla_commitlog_flush
scylla_commitlog_bytes_written
scylla_commitlog_pending_allocations
scylla_commitlog_requests_blocked_memory
scylla_commitlog_flush_limit_exceeded
scylla_commitlog_disk_total_bytes
scylla_commitlog_disk_active_bytes
scylla_commitlog_disk_slack_end_bytes
Scylla generates many metrics, and when multiplied by the number of
shards, the total number of metrics adds a significant load to a
monitoring server.
With multi-tier monitoring, it is helpful to have a smaller subset of
metrics users care about and allow them to get only those.
This patch adds two kind of labels, the a __level label, currently with
a single value, but we can add more in the future.
The second kind, is a cross feature label, curently for alternator, cdc
and cas.
We will use the __level label to mark the interesting user-facing metrics.
The current level value is:
basic - metrics for Scylla monitoring
In this phase, basic will mark all metrics used in the dashboards.
In practice, without any configuration change, Prometheus would get the
same metrics as it gets today.
While it is possible to filter by the label, e.g.:
curl http://localhost:9180/metrics?__level=basic
The labels themselves are not reported thanks to label filtering of
labels begin with __.
The feature labels:
__cdc, __cas and __alternator can be an easy way to disable a set of
metrics when not using a feature.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Clean up the code by using direct return of future for `run_op_with_retry`.
This can be done as the `run_op_with_retry` function is already returning
a future that we can reuse directly. What needs to be taken care of is
to not use temporaries referenced from inside the lambda passed to the
`run_op_with_retry`.
Allow setting the voters and non-voters in a single operation. This
ensures that the configuration changes are done atomically.
In particular, we don't want to set voters and non-voters separately
because it could lead to inconsistencies or even the loss of quorum.
This change also partially reverts the commit 115005d, as we will only
need the convenience wrappers for removing the voters (not for adding
them).
Refs: scylladb/scylladb#18793
For the limited voters feature to work properly we need to make sure
that we are only managing the voter status through the topology
coordinator. This means that we should not change the node votership
from the storage_service module for the raft topology directly.
This needs to be done in addition to dropping of the votership change
from the storage_service module.
The `remove_from_raft_config` is redundant and can be removed because
a successfully completed `removenode` operation implies that the node
has been removed from group 0 by the topology coordinator.
Refs: scylladb/scylladb#22860
Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969
For the limited voters feature to work properly we need to make sure
that we are only managing the voter status through the topology
coordinator. This means that we should not change the node votership
from the storage_service module for the raft topology directly.
We can drop the voter status changes from the storage_service module
because the topology coordinator will handle the votership changes
eventually. The calls in the storage_service module were not essential
and were only used for optimization (improving the HA under certain
conditions).
This has effect on the timing in the tablets migration test though,
as it relied on the node being made non-voter from the service_storage
`raft_removenode()` function. The fix is to add another server to the
topology to make sure we will keep the quorum.
Previously the test worked because the test waits for an injection to be
reached and it was ensured that the injection (log line) has only been
triggered after the node has been made non-voter from the
`raft_removenode()`. This is not the case anymore. An alternative fix
would be to wait for the first node to be made non-voter before stopping
the second server, but this would make the test more complex (and it is
not strictly required to only use 4 servers in the test, it has been
only done for optimization purposes).
Fixes: scylladb/scylladb#22860
Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969
This is continuation of #21533
There are two almost identical helpers in api/ -- validate_table(ks, cf) and get_uuid(ks, cf). Both check if the ks:cf table exists, throwing bad_param_exception if it doesn't. There's slight difference in their usage, namely -- callers of the latter one get the table_id found and make use of it, while the former helper is void and its callers need to re-search for the uuid again if the need (spoiler: they do).
This PR merges two helpers together, so there's less code to maintain. As a nice side effect, the existing validate_table() callers save one re-lookup of the ks:cf pair in database mappings.
Affected endpoints are validated by existing tests:
* column_family/{autocompation|tombstone_gc|compaction_strategy}, validated by the tests described in #21533
* /storage_service/{range_to_endpoint_map|describe_ring|ownership}, validated by nodetool tests
* /storage_service/tablets/{move|repair}, validated by tablets move and repair tests
Closesscylladb/scylladb#22742
* github.com:scylladb/scylladb:
api: Remove get_uuid() local helper
api: Make use of validate_table()'s table_id
api: Make validate_table() helper return table_id after validation
api: Change validate_table()'s ctx argument to database
Previously, variables were marked as const, causing std::move() calls to
be redundant as reported by GCC warnings. This change either removes
const qualifiers or marks related lambdas as mutable, allowing the
compiler to properly utilize move constructors for better performance.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23066
As a part of the moving to bare pytest we need to extract the required test
environment preparation steps into pytest's hooks/fixtures.
Do this for S3 mock stuff (MinioServer, MockS3Server, and S3ProxyServer)
and for directories with test artifacts.
For compatibility reason add --test-py-init CLI option for bare pytest
test runner: need to add it to pytest command if you need test.py
stuff in your tests (boost, topology, etc.)
Also, postpone initialization of TestSuite.artifacts and TestSuite.hosts
from import-time to runtime.
Closesscylladb/scylladb#23087
Fix GCC warning about moving from a const reference in mp_row_consumer_k_l::flush_if_needed.
Since position_in_partition::key() returns a const reference, std::move has no effect.
Considered adding an rvalue reference overload (clustering_key_prefix&& key() &&) but
since the "me" sstable format is mandatory since 63b266e9, this approach offers no benefit.
This change simply removes the redundant std::move() call to silence the warning and
improve code clarity.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23085
Currently for system keyspace part of config members are configured outside of this helper, in the caller. It's more consistent to have full config initialization in one place.
Closesscylladb/scylladb#22975
* github.com:scylladb/scylladb:
replica: Mark database::make_keyspace_config() private
replica: Prepare full keyspace config in make_keyspace_config()
Enhance how the script handles remote repository selection for a given
SHA1 commit hash.
Previously, in 3bdbe620, the script fetched from all remotes containing
the product name, which could lead to inefficiencies and errors,
especially with multiple matching remotes. Now, it first checks if the
SHA1 is in any local remote-tracking branch, using that remote if found,
and otherwise fetches from each remote sequentially to find the first
one containing the SHA1. This approach minimizes unnecessary fetches,
making the script more efficient for debugging coredumps in repositories
with multiple remotes.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23026
LIMIT and PER PARTITION LIMIT limit the number of rows returned or taken
into consideration by a query. It makes no logical sense to have this
value at less than 1. Cassandra also has this requirement.
This patch ensures that the limit value is strictly positive and adds
an explicit test for it - it was only tested in a test ported from
Cassandra, that is disabled due to other issues.
Closesscylladb/scylladb#23013
If hosts and/or dcs filters are specified for tablet repair and
some replicas match these filters, choose the replica that will
be the repair master according to round-robin principle
(currently it's always the first replica).
If hosts and/or dcs filters are specified for tablet repair and
no replica matches these filters, the repair succeeds and
the repair request is removed (currently an exception is thrown
and tablet repair scheduler reschedules the repair forever).
Fixes: https://github.com/scylladb/scylladb/issues/23100.
Needs backport to 2025.1 that introduces hosts and dcs filters for tablet repair
Closesscylladb/scylladb#23101
* github.com:scylladb/scylladb:
test: add new cases to tablet_repair tests
test: extract repiar check to function
locator: add round-robin selection of filtered replicas
locator: add tablet_task_info::selected_by_filters
service: finish repair successfully if no matching replica found
This commit adds the upgrade guides relevant in version 2025.1:
- From 6.2 to 2025.1
- From 2024.x to 2025.1
It also removes the upgrade guides that are not relevant in 2025.1 source available:
- Open Source upgrade guides
- From Open Source to Enterprise upgrade guides
- Links to the Enterprise upgrade guides
Also, as part of this PR, the remaining relevant content has been moved to
the new About Upgrade page.
WHAT NEEDS TO BE REVIEWED
- Review the instructions in the 6.2-to-2025.1 guide
- Review the instructions in the 2024.x-to-2025.1 guide
- Verify that there are no references to Open Source and Enterprise.
The scope of this PR does not have to include metrics - the info can be added
in a follow-up PR.
Fixes https://github.com/scylladb/scylladb/issues/22208
Fixes https://github.com/scylladb/scylladb/issues/22209
Fixes https://github.com/scylladb/scylladb/issues/23072
Fixes https://github.com/scylladb/scylladb/issues/22346Closesscylladb/scylladb#22352
Previously, when result_generator's default constructor was called, the
_stats member variable remained uninitialized. This could lead to
undefined behavior in release builds where uninitialized values are
unpredictable, making issues difficult to debug.
This change initializes the pointer to nullptr, ensuring consistent
behavior across all build types and preventing potential memory-related
bugs.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23073
When implementing the copy constructor for `sstable_set` (derived from
`enable_lw_shared_from_this`), we intentionally need the parent's default
constructor rather than its copy constructor. This is because each new
`sstable_set` instance maintains its own reference count and owns a clone
of the source object's implementation (`x._impl->clone()`).
Although this behavior is correct, GCC warns about not calling the parent's
copy constructor. This change explicitly calls the parent's default constructor
to:
1. Silence GCC warnings
2. Clearly document our intention to use the default constructor
3. Follow best practices for constructor initialization
The functionality remains unchanged, but the code is now more explicit about
its design and free of compiler warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23083
If hosts and/or dcs filters are specified for tablet repair and
no replica matches these filters, an exception is thrown. The repair
fails and tablet repair scheduler reschedules it forever.
Such a repair should actually succeed (as all specified relpicas were
repaired) and the repair request should be removed.
Treat the repair as successful if the filters were specified and
selected no replica.
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.
Fixes: scylladb/scylladb#22919
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.
This commit adds a link to the Limitations section on the Tablets page
to the CQL pag, the tablets option.
This is actually the place where the user will need the information:
when creating a keyspace.
In addition, I've reorganized the section for better readability
(otherwise, the section about limitations was easy to miss)
and moved the section up on the page.
Note that I've removed the updated content from the `_common` folder
(which I deleted) to the .rst page - we no longer split OSS and Enterprise,
so there's no need to keep using the `scylladb_include_flag` directive
to include OSS- and Ent-specific content.
Fixes https://github.com/scylladb/scylladb/issues/22892
Fixes https://github.com/scylladb/scylladb/issues/22940Closesscylladb/scylladb#22939
Previously, the clang-tidy.yaml workflow would cancel the clang-tidy job
when a comment wasn't prefixed with "/clang-tidy", instead of skipping it.
This cancellation triggered unnecessary email notifications for developers
with GitHub action notifications enabled.
This change modifies the workflow to only run clang-tidy when the
read-toolchain job succeeds, reducing notification noise by properly
skipping the job rather than cancelling it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23084
after introducing the test.py subfolders support,
test.py start creating weird log files like
testlog/topology_custom.mv/tablets/test_mv_tablets.1
that affect failed test collection logic
this commit fixes this and test.py logs as previously in testlog directory
without any subfolders: topology_custom.mv_tablets_test_mv_tablets.1
Closesscylladb/scylladb#23009
Fix a bug where std::same_as<...> constraint was incorrectly used as a
simple requirement instead of a nested requirement or part of a
conjunction. This caused the constraint to be always satisfied
regardless of the actual types involved.
This change promotes std::same_as<...> to a top-level constraint,
ensuring proper type checking while improving code readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23068
Replace boost::accumulate() calls with std::ranges facilities. This
change reduces external dependencies and modernizes the codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23062
The tree code have const and non-const overloads for searching methods
like find(), lower_bound(), etc. Not to implement them twice, it's coded
like
const_iterator find() const {
... // the implementation itself
}
iterator find() {
return iterator(const_cast<const *>(this)->find());
}
i.e. -- const overload is called, and returned by it const_iterator is
converted into a non-const iterator. For that the latter has dedicated
constructor with two inaccuracies: it's not marked as explicit and it
accepts const rvalue reference.
This patch fixes both.
Althogh this disables implicit const -> non-const conversion of
iterators, the constructor in question is public, which still opens a
way for conversion (without const_cast<>). This constructor is better
be marked private, but there's double_decker class that uses bptree
and exploits the same hacks in its finding methods, so it needs this
constructor to be callable. Alas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23069
Replace value-based exception catching with reference-based catching to address
GCC warnings about polymorphic type slicing:
```
warning: catching polymorphic type ‘class seastar::rpc::stream_closed’ by value [-Wcatch-value=]
```
When catching polymorphic exceptions by value, the C++ runtime copies the
thrown exception into a new instance of the specified type, slicing the
actual exception and potentially losing important information. This change
ensures all polymorphic exceptions are caught by reference to preserve the
complete exception state.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23064
If user fails to supply the AttributeDefinitions parameter when creating
a table, Scylla used to fail on RAPIDJSON_ASSERT. Now it calls a polite
exception, which is fully in-line with what DynamoDB does.
The commit supplies also a new, relevant test routine.
Fixes#23043Closesscylladb/scylladb#23041
Fixes#22314
Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.
Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.
Closesscylladb/scylladb#22327
* github.com:scylladb/scylladb:
tools: Add standard extensions and propagate to schema load
cql_test_env: Use add all extensions instead of inidividually
main: Move extensions adding to function
tomstone_gc: Make validate work for tools
class clustering_range is a range of Clustering Key Prefixes implemented
as interval<clustering_key_prefix>. However, due to the nature of
Clustering Key Prefix, the ordering of clustering_range is complex and
does not satisfy the invariant of interval<>. To be more specific, as a
comment in interval<> implementation states: “The end bound can never be
smaller than the start bound”. As a range of CKP violates the invariant,
some algorithms, like intersection(), can return incorrect results.
For more details refer to scylladb#8157, scylladb#21604, scylladb#22817.
This commit:
- Add a WARNING comment to discourage usage of clustering_range
- Add WARNING comments to potentially incorrect uses of
interval<clustering_key_prefix> non-trivial methods
- Add a FIXME comment to incorrect use of
interval<clustering_key_prefix_view>::deoverlap and WARNING comments
to related interval<clustering_key_prefix_view> misuse.
Closesscylladb/scylladb#22913
Currently, when we add servers to the cluster in the test, we use
a 60s timeout which proved to be not enough in one of the debug runs.
There is no reason for this test to use a shorter timeout than all
the other tests, so in this patch we reset it to the higher default.
Fixes https://github.com/scylladb/scylladb/issues/23047Closesscylladb/scylladb#23048
Currently for system keyspace part of config members are configured
outside of this helper, in the caller. It's more consistent to have full
config initialization in one place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Drop it from files that obviously don't need it. Also kill some forward
declarations while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#22979
The test `test_mv_topology_change` is a regression test for
scylladb/scylladb#19529. The problem was that CL=ANY writes issued when
all replicas were down would be kept in memory until the timeout. In
particular, MV updates are CL=ANY writes and have a 5 minute timeout.
When doing topology operations for vnodes or when migrating tablet
replicas, the cluster goes through stages where the replica sets for
writes undergo changes, and the writes started with the old replica set
need to be drained first.
Because of the aforementioned MV updates, the removenode operation could
be delayed by 5 minutes or more. Therefore, the
`test_mv_topology_change` test uses a short timeout for the removenode
operation, i.e. 30s. Apparently, this is too low for the debug mode and
the test has been observed to time out even though the removenode
operation is progressing fine.
Increase the timeout to 60s. This is the lowest timeout for the
removenode operation that we currently use among the in-repo tests, and
is lower than 5 minutes so the test will still serve its purpose.
Fixes: scylladb/scylladb#22953Closesscylladb/scylladb#22958
While generally better to reduce inline code, here we get
rid of the clustering_interval_set.hh dependency, which in turns
depends on boost interval_set, a large dependency.
incremental_compaction_test.cc is adjusted for a missing header.
Closesscylladb/scylladb#22957
Refs scylla-enterprise#5185
Fixes#22901
If a tls socket gets EPIPE the error is not translated to a specific
gnutls error code, but only a generic ERROR_PULL/PUSH. Since we treat
EPIPE as ignorable for plain sockets, we need to unwind nested exception
here to detect that the error was in fact due to this, so we can suppress
log output for this.
Closesscylladb/scylladb#22888
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22997
When a backup upload is aborted due to instance shutdown, change the log
level from ERROR to INFO since this is expected behavior. Previously,
`abort_requested_exception` during upload would trigger an ERROR log, causing
test failures since error logs indicate unexpected issues.
This change:
- Catches `abort_requested_exception` specifically during file uploads
- Logs these shutdown-triggered aborts at INFO level instead of ERROR
- Aligns with how `abort_requested_exception` is handled elsewhere in the service
This prevents false test failures while still informing administrators
about aborted uploads during shutdown.
Fixesscylladb/scylladb#22391
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22995
This series achieves two things:
1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards
This will result in new tables having at least 10 tablet replicas per
shard by default.
We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.
Fixes https://github.com/scylladb/scylladb/issues/21967
2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count
The per-shard goal is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
The scaling is applied after computing desired tablet count due to
all other factors: per-table tablet count hints, defaults, average tablet size.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
When creating a new table, its tablet count is determined by tablet
scheduler using the scheduler logic, as if the table was already created.
So any scaling due to per-shard tablet count goal is reflected immediately
when creating a table. It may however still take some time for the system
to shrink existing tables. We don't reject requests to create new tables.
Fixes#21458Closesscylladb/scylladb#22522
* github.com:scylladb/scylladb:
config, tablets: Allow tablets_initial_scale_factor to be a fraction
test: tablets_test: Test scaling when creating lots of tables
test: tablets_test: Test tablet count changes on per-table option and config changes
test: tablets_test: Add support for auto-split mode
test: cql_test_env: Expose db config
config: Make tablets_initial_scale_factor live-updateable
tablets: load_balancer: Pick initial_scale_factor from config
tablets, load_balancer: Fix and improve logging of resize decisions
tablets, load_balancer: Log reason for target tablet count
tablets: load_balancer: Move hints processing to tablet scheduler
tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
tablets: load_balancer: Determine desired count from size separately from count from options
tablets: load_balancer: Determine resize decision from target tablet count
tablets: load_balancer: Allow splits even if table stats not available
tablets: load_balancer: Extract make_sizing_plan()
tablets: Add formatter for resize_decision::way_type
tablets: load_balancer: Simplify resize_urgency_cmp()
tablets: load_balancer: Keep config items as instance members
locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()
tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
tablets: Set default initial tablet count scale to 10
tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
tablets: load_balancer: Extract get_schema_and_rs()
tablets: load_balancer: Drop test_mode
Altering a keyspace (that has tablets enabled) without changing
tablets attributes, i.e. no `AND tablets = {...}` results in incorrect
"Update Keyspace..." log message being printed. The printed log
contains "tablets={"enabled":false}".
Refs https://github.com/scylladb/scylladb/issues/22261Closesscylladb/scylladb#22324
Tablet sequeunce number was part of the tablet identifier together
with last token, so on split and merge all ids changed and it appeared
in the simulator as all tablets of a table dropping and being created
anew. That's confusing. After this change, only last token is part of
the id, so split appears as adding tablets and merge appears as
removing half the tablets, which is more accurate.
Also includes an enhancement to make showing of tablet id text
optional in table mode.
Closesscylladb/scylladb#22981
* github.com:scylladb/scylladb:
tablet-mon.py: Don't show merges and splits as full table recreations
tablet-mon.py: Add toggle for tablet ids
We currently depends on hostname command to get local IP, but we can do
this on Python API.
After the change, we can drop the package.
Closesscylladb/scylladb#22909
This commit removes the OSS version name, which is irrelevant
and confusing for 2025.1 and later users.
Also, it updates the warning to avoid specifying the release
when the deprecated feature will be removed.
Fixes https://github.com/scylladb/scylladb/issues/22839Closesscylladb/scylladb#22936
There's a sstable_directory::create_pending_deletion_log() helper method that's called by sstable's filesystem_storage atomic-delete methods and that prepares the deletion log for a bunch of sstables. For that method to do its job it needs to get private sstable->_storage field (which is always the filesystem_storage one), also the atomic-delete transparent context object is leaked into the sstable_directory code and low-level sstable storage code needs to include higher-level sstable_directory header.
This patch unties these knots. As the result:
- friendship between sstable and sstable_directory is removed
- transparent atomic_delete_context is encapsulated in storage.(cc|hh) code
- less code for create_pending_deletion_log() to dump TOC filename into log
Closesscylladb/scylladb#22823
* github.com:scylladb/scylladb:
sstable: Unfriend sstable_directory class
sstable_directory: Move sstable_directory::pending_delete_result
sstable_directory: Calculate prefixes outside of create_pending_deletion_log()
sstable_directory: Introduce local pending_delete_log variable
sstable_directory: Relax toc file dumping to deletion log
Before this patch we silently allowed and ignored PER PARTITION LIMIT.
SELECT DISTINCT requires all the partition key columns, which means that
setting PER PARTITION LIMIT is redundant - only one result will be
returned from every partition anyway.
Cassandra behaves the same way, so this patch also ensures
compatibility.
Fixesscylladb/scylladb#15109Closesscylladb/scylladb#22950
pip_packages is an associative array, which in bash is constructed
as ([key]=value...). In our case the value is often empty (indicating
no version constraint). Shellcheck warns against it, since `[key]= x`
could be a mistype of `[key]=x`. It's not in our case, but shellcheck
doesn't know that.
Make shellcheck happier by specifying the empty values explicitly.
Closesscylladb/scylladb#22990
Add verbose logging to identify failing test combinations in multi-DC
setup:
- Log replication factor (RF) and consistency level (CL) for each test
iteration
- Add validation checks for empty result sets
Improve error handling:
- Before indexing in a list, use `assert` to check for its emptiness
- Use assertion failures instead of exceptions for clearer test diagnostics
This change helps debug test failures by showing which RF/CL
combinations cause inconsistent results between zero-token and regular
nodes.
Refs scylladb/scylladb#22967
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22968
The test simulates the cluster getting stuck during upgrade to raft
topology due to majority loss, and then verifies that it's possible to
get out of the situation by performing recovery and redoing the upgrade.
Fixes: #17410Closesscylladb/scylladb#17675
* https://github.com/scylladb/scylladb:
test/topology_experimental_raft: add test_topology_upgrade_stuck
test.py: bump minimum python version to 3.11
test.py: move gather_safely to pylib utils
cdc: generation: don't capture token metadata when retrying update
test.py: topology: ignore hosts when waiting for group0 consistency
raft: add error injection that drops append_entries
topology_coordinator: add injection which makes upgrade get stuck
because we don't care about the exact output of grep, let's silence
its output. also, no need to check for the string is empty, so let's
just use the status code of the grep for the return value of the
function, more idiomatic this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22737
In some cases the paused/unpaused node can hang not after 30s timeout.
This make the test flaky. Change the condition to always check the
coordinator's log if there is a hung node.
Add `stop_after_streaming` to the list of error injections which can
cause a node's hang.
Also add a wait for a new coordinator election in cluster events
which cause such elections.
Closesscylladb/scylladb#22825
`slice.is_reversed()` was falsely flagged as accessing moved data, since
the underlying enum_set remains valid after move. However, to improve code
clarity and silence the warning, now reference `command->slice` directly
instead, which is guaranteed to be valid as the move target.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22971
Eliminate one try/catch block around call to wr.close()
by using coroutine::as_future.
Mark error paths as `[[unlikely]]`.
Use `coroutine::return_exception_ptr` to avoid rethrowing
the final exception.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#22831
Before these changes, the script didn't update the listed pip packages
if they were already installed. If the latest version of Scylla started
using new features and required an updated Python driver, for example,
the developers (and possibly the user) were forced to update it manually.
In this commit, we modify the script so that it updates the installed
packages when run. This should make things easier for everyone.
Closesscylladb/scylladb#22912
Tablet sequeunce number was part of the tablet identifier together
with last token, so on split and merge all ids changed and it appeared
in the simulator as all tablets of a table dropping and being created
anew. That's confusing. After this change, only last token is part of
the id, so split appears as adding tablets and merge appears as
removing half the tablets, which is more accurate.
Replace complex boolean expression:
```py
not driver_response_future.has_more_pages or not all_pages
```
with clearer equivalent:
```py
driver_response_future.has_more_pages and all_pages
```
The new expression is more intuitive as it directly checks for both
conditions (having more pages and wanting all pages) rather than using double
negation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22969
Before the limited voters feature, the "raft_ignore_nodes" test was
relying upon the fact that all nodes will become voters.
With the limited voters feature, the test needs to be adjusted to
ensure that we do not lose the majority of the cluster. This could
happen when there are 7 nodes, but only 5 of them are voters - then if
we kill 3 nodes randomly we might end up with only 2 voters left.
Therefore we need to ensure that we only stop the appropriate number of
voter nodes. So we need to determine which nodes became voters and which
ones are non-voters, and select the nodes to be stopped based on that.
That means with 7 nodes and 5 voters, we can stop up to 2 voter nodes,
but at least one of the stopped nodes must be a non-voter.
Fixes: scylladb/scylladb#22902
Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969Closesscylladb/scylladb#22904
Now that we support suite subfolders, there is no
need to create an own suite for topology_tasks and topology_random_failures.
Closesscylladb/scylladb#22879
* https://github.com/scylladb/scylladb:
test.py: merge topology_tasks suite into topology_custom suite
test.py: merge topology_random_failures suite into topology_customs
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.
This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
manner.
Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also
performs a refactoring to populate all init data to system tables even
before group0 creation(via `raft_initialize_discovery_leader` function).
Now because `raft_initialize_discovery_leader` happens before the group
0 creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.
Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.
By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#21114Closesscylladb/scylladb#22484
* https://github.com/scylladb/scylladb:
storage_service: Remove the variable _manage_topology_change_kind_from_group0
storage_service: fix indentation after the previous commit
raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
service/raft: Refactor mutation writing helper functions.
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.
Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.
Fixes#22715
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#22922
Replace boost::range::find() calls with std::ranges::find(). This
change reduces external dependencies and modernizes the codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22942
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.
Fixes#21859.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22933
This PR improves and refactors the test.topology.util new_test_keyspace generator
and adds a corresponding create_new_test_keyspace function to be used by most if not
all topology unit tests in order to standardize the way the tests create keyspaces
and to mitigate the python driver create keyspace retry issue: https://github.com/scylladb/python-driver/issues/317Fixes#22342Fixes#21905
Refs https://github.com/scylladb/scylla-enterprise/issues/5060
* No backport required, though may be desired to stabilize CI also in release branches.
Closesscylladb/scylladb#22399
* github.com:scylladb/scylladb:
test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace
test/repair: create_table_insert_data_for_repair: create keyspace with unique name
topology_tasks/test_tablet_tasks: use new_test_keyspace
topology_tasks/test_node_ops_tasks: use new_test_keyspace
topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace
topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace
topology_custom/test_view_build_status: use new_test_keyspace
topology_custom/test_truncate_with_tablets: use new_test_keyspace
topology_custom/test_topology_failure_recovery: use new_test_keyspace
topology_custom/test_tablets_removenode: use create_new_test_keyspace
topology_custom/test_tablets_migration: use new_test_keyspace
topology_custom/test_tablets_merge: use new_test_keyspace
topology_custom/test_tablets_intranode: use new_test_keyspace
topology_custom/test_tablets_cql: use new_test_keyspace
topology_custom/test_tablets2: use *new_test_keyspace
topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function
topology_custom/test_tablets: use new_test_keyspace
topology_custom/test_table_desc_read_barrier: use new_test_keyspace
topology_custom/test_shutdown_hang: use new_test_keyspace
topology_custom/test_select_from_mutation_fragments: use new_test_keyspace
topology_custom/test_rpc_compression: use new_test_keyspace
topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace
topology_custom/test_raft_snapshot_truncation: use create_new_test_keyspace
topology_custom/test_raft_no_quorum: use new_test_keyspace
topology_custom/test_raft_fix_broken_snapshot: use new_test_keyspace
topology_custom/test_query_rebounce: use new_test_keyspace
topology_custom/test_not_enough_token_owners: use new_test_keyspace
topology_custom/test_node_shutdown_waits_for_pending_requests: use new_test_keyspace
topology_custom/test_node_isolation: use create_new_test_keyspace
topology_custom/test_mv_topology_change: use new_test_keyspace
topology_custom/test_mv_tablets_replace: use new_test_keyspace
topology_custom/test_mv_tablets_empty_ip: use new_test_keyspace
topology_custom/test_mv_tablets: use new_test_keyspace
topology_custom/test_mv_read_concurrency: use new_test_keyspace
topology_custom/test_mv_fail_building: use new_test_keyspace
topology_custom/test_mv_delete_partitions: use new_test_keyspace
topology_custom/test_mv_building: use new_test_keyspace
topology_custom/test_mv_backlog: use new_test_keyspace
topology_custom/test_mv_admission_control: use new_test_keyspace
topology_custom/test_major_compaction: use new_test_keyspace
topology_custom/test_maintenance_mode: use new_test_keyspace
topology_custom/test_lwt_semaphore: use new_test_keyspace
topology_custom/test_ip_mappings: use new_test_keyspace
topology_custom/test_hints: use new_test_keyspace
topology_custom/test_group0_schema_versioning: use new_test_keyspace
topology_custom/test_data_resurrection_after_cleanup: use new_test_keyspace
topology_custom/test_read_repair_with_conflicting_hash_keys: use new_test_keyspace
topology_custom/test_read_repair: use new_test_keyspace
topology_custom/test_compacting_reader_tombstone_gc_with_data_in_memtable: use new_test_keyspace
topology_custom/test_commitlog_segment_data_resurrection: use new_test_keyspace
topology_custom/test_change_replication_factor_1_to_0: use new_test_keyspace
topology/test_tls: test_upgrade_to_ssl: use new_test_keyspace
test/topology/util: new_test_keyspace: drop keyspace only on success
test/topology/util: refactor new_test_keyspace
test/topology/util: CREATE KEYSPACE IF NOT EXISTS
test/topology/util: new_test_keyspace: accept ManagerClient
Replace boost::accumulate() calls with std::ranges::fold_left(). This
change reduces external dependencies and modernizes the codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22924
Replace boost::find() calls with std::ranges::find() and std::ranges::contains()
to leverage modern C++ standard library features. This change reduces external
dependencies and modernizes the codebase.
The following changes were made:
- Replaced boost::find() with std::ranges::find() where index/iterator is needed
- Used std::ranges::contains() for simple element presence checks
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22920
rebalance_tablets() was performing migrations and merges automatically
but not splits, because splits need to be acked by replicas via
load_stats. It's inconvenient in tests which want to rebalance to the
equilibrium point. This patch changes rebalance_tablets() to split
automatically by default, can be disabled for tests which expect
differently.
shared_load_stats was introduced to provide a stable holder of
load_stats which can be reused across rebalance_tablets() calls.
Resize is no longer only due to avg tablet size. Log avg tablet size as an
information, not the reason, and log the true reason for target tablet
count.
Hints have common meaning for all strategies, so the logic
belongs more to make_sizing_plan().
As a side effect, we can reuse shard capacity computation across
tables, which reduces computational complexity from O(tables*nodes) to
O(tables * DCs + nodes)
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
This makes decisions made by the scheduler consistent with decisions
made on table creation, with regard to tablet count.
We want to avoid over-allocation of tablets when table is created,
which would then be reduced by the scheduler's scaling logic. Not just
to avoid wasteful migrations post table creation, but to respect the
per-shard goal. To respect the per-shard goal, the algorithm will no
longer be as simple as looking at hints, and we want to share the
algorithm between the scheduler and initial tablet allocator. So
invoke the scheduler to get the tablet count when table is created.
This is in preparation for using the sizing plan during table creation
where we never have size stats, and hints are the only determining
factor for target tablet count.
Resize plan making will now happen in two stages:
1) Determine desired tablet counts per table (sizing plan)
2) Schedule resize decisions
We need intermediate step in the resize plan making, which gives us
the planned tablet counts, so that we can plug this part of the
algorithm into initial tablet allocation on table construction.
We want decisisons made by the scheduler to be consistent with
decisions made on table creation. We want to avoid over-allocation of
tablets when table is created, which would then be reduced by the
scheduler. Not just to avoid wasteful migrations post table creation,
but to respect the per-shard goal. To respect the per-shard goal, the
algorithm will no longer be as simple as looking at hints, and we want
to share the algorithm between the scheduler and initial tablet
allocator.
Also, this sizing plan will be later plugged into a virtual table for
observability.
Logic is preserved since target tablet size is constant for all
tables.
Dropping d.target_max_tablet_size() will allow us to move it
to the load_balancer scope.
Currently the scale is applied post rounding up of tablet count so
that tablet count per shard is at least 1. In order to be able to use
the scale to increase tablet count per shard, we need to apply it
prior to division by RF, otherwise we will overshoot per-shard tablet
replica count.
Example:
4 nodes, -c1, rf=3, initial_tablets_scale=10
Before: initial_tablet_count=20, tablet-per-shard=15
After: initial_tablet_count=14, tablets-per-shard=10.5
This will result in new tables having at least 10 tablet replicas per
shard by default.
We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.
Fixes#21967
In some tests, we explicity set the initial scale to 1 as some of the
existing tests assume 1 compaction group per shard.
test.py uses a lower default. Having many tablets per shard slows down
certain topology operations like decommission/replace/removenode,
where the running time is proportional to tablet count, not data size,
because constant cost (latency) of migration dominates. This latency
is due to group0 operations and barriers. This is especially
pronounced in debug mode. Scheduler allows at most 2 migrations per
shard, so this latency becomes a determining factor for decommission
speed.
To avoid this problem in tests, we use lower default for tablet count per
shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we
could compensate by allowing more concurrency when migrating small
tablets, but there's no infrastructure for that yet.
I observed that with 10 tablets per shard, debug-mode
topology_custom.mv/test_mv_topology_change starts to time-out during
removenode (30 s).
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.
Fixes#22790Closesscylladb/scylladb#22812
It was only needed there for create_pending_deletion_log() method to get
private "_storage" from sstable. Now it's all gone and friendship can be
broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... to where it belongs -- to the filesystem storage driver itself.
Continuation of the previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question walks the list of sstables and accumulates
sstables' prefixes into a set on pending_delete_result object. The set
in question is not used at all in this method and is in fact alien to it
-- the p.d._result object is used by the filesystem storage driver as
atomic deletion prepare/commit transparent context.
Said that, move the whole pending_delete_result to where it belongs and
relax the create_pending_deletion_log() to only return the log
directory path string.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current code takes sstable prefix() (e.g. the /foo/bar string), then
trims from its fron the basedir (e.g. the /foo/ string) and then writes
the remainder, a slash and TOC component name (e.g. the xxx-TOC.txt
string). The final result is "bar/xxx-TOC.txt" string.
The taking into account sstable.toc_filename() renders into
sstable.prefix + \slash + component-name, the above result can be
achieved by trimming basedir directory from toc_filename().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lambda which dumps the diagnostics for each semaphore, is static.
Considering that said lambda captures a local (writeln) by reference, this
is wrong on two levels:
* The writeln captured on the shard which happens to initialize this
static, will be used on all shards.
* The writeln captured on the first dump, will be used on later dumps,
possibly triggering a segfault.
Drop the `static` to make the lambda local and resolve this problem.
Fixes: scylladb/scylladb#22756Closesscylladb/scylladb#22776
Recently, when running Alternator tests we get hundreds of warnings like
the following from basically all test files:
/usr/lib/python3.12/site-packages/botocore/crt/auth.py:59:
DeprecationWarning: datetime.datetime.utcnow() is deprecated and
scheduled for removal in a future version. Use timezone-aware objects
to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
/usr/local/lib/python3.12/site-packages/pytest_elk_reporter.py:299:
DeprecationWarning: datetime.datetime.utcnow() is deprecated and
scheduled for removal in a future version. Use timezone-aware objects
to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
These warnings all come from two libraries that we use in the tests -
botocore is used by Alternator tests, and elk reporter is a plugin that
we don't actually use, but it is installed by dtest and we often see
it in our runs as well. These warnings have zero interest to us - not
only do we not care if botocore uses some deprecated Python APIs and
will need to be updated in the future, all these warnings are hiding
*real* warnings about deprecated things we actually use in our own
test code.
The patch modifies test/pytest.ini (used by all our Python tests,
including but not limited to Alternator tests) to ignore deprecation
warnings from *inside* these two libraries, botocore and elk_reporter.
After this patch, test/alternator/run finishes without any warnings
at all. test/cqlpy does still have a few warnings left, which earlier
were hidden by the thousands of spammy warning eliminated in this patch.
We fix one of these warnings in this patch:
ResultSet indexing support will be removed in 4.0.
Consider using ResultSet.one()
by doing exactly what the warning recommended.
Some deprecation warnings in test/cqlpy remain in calls to
get_query_trace(). The "blame" for these warning is misplaced - this
function is part of the cassandra driver, but Python seems to think it's
part of our test code so I can't avoid them with the pytest.ini trick,
I'm not sure why. So I don't know yet how to eliminate these last warnings.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22881
Modify CMake configuration to only apply "-Xclang" options when building
with the Clang compiler. These options are Clang-specific and can cause
errors or warnings when used with other compilers like g++.
This change:
- Adds compiler detection to conditionally apply Clang-specific flags
- Prevents build failures when using non-Clang compilers
Previously, the build system would apply these flags universally, which
could lead to compilation errors with other compilers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22899
Use std::to_underlying() when comparing unsigned types with enumeration values
to fix type mismatch warnings in GCC-14. This specifically addresses an issue in
utils/advanced_rpc_compressor.hh where comparing a uint8_t with 0 triggered a
'-Werror=type-limits' warning:
```
error: comparison is always false due to limited range of data type [-Werror=type-limits]
if (x < 0 || x >= static_cast<underlying>(type::COUNT))
~~^~~
```
Using std::to_underlying() provides clearer type semantics and avoids these kind
of comparison warnings. This change improves code readability while maintaining
the same behavior.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22898
Using the new_test_keyspace fixture is awkward for this test
as it is written to explicitly drop the created keyspaces
at certain points.
Therefore, just use create_new_test_keyspace to standardize the
creation procedure.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
new_test_keyspace is problematic here since
the presence of the banned node can fail the automatic drop of
the test keyspace due to NoHostAvailable (in debug mode for
some reason)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define create_new_test_keyspace that can be used in
cases we cannot automatically drop the newly created keyspace
due to e.g. loss of raft majority at the end of the test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Following patch will convert topology tests to use
new_test_keyspace and friends.
Some tests restart server and reset the driver connection
so we cannot use the original cql Session for
dropping the created keyspace in the `finally` block.
Pass the ManagerClient instead to get a new cql
session for dropping the keyspace.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, we can not have more than one global topology operation at the same time. This means that we can not have concurrent truncate operations because truncate is implemented as a global topology operation.
Truncate excludes with other topology operations, and has to wait for those to complete before truncate starts executing. This can lead to truncate timeouts. In these cases the client retries the truncate operation, which will check for ongoing global topology operations, and will fail with
an "Another global topology request is ongoing, please retry." error.
This can be avoided by truncate checking if the ongoing global topology operation is a truncate running for the same table who's truncate has just been requested again. In this case, we can wait for the ongoing truncate to complete instead of immediately failing the operation, and
provide a better user experience.
This is an improvement, backport is not needed.
Closes#22166Closesscylladb/scylladb#22371
* github.com:scylladb/scylladb:
test: add test for re-cycling ongoing truncate operations
truncate: add additional logging and improve error message during truncate
storage_proxy: wait on already running truncate for the same table
storage_proxy: allow multiple truncate table fibers per shard
Return after executing the global metadata barrier to allow the topology
handler to handle any transitions that might have started by a
concurrect transaction.
Fixes#22792
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#22793
Refs #22628
Adds exception handler + cleanup for the case where we have a bad config/env vars (hint minio) or similar, such that we fail with exception during setting up the EAR context. In a normal startup, this is ok. We will report the exception, and the do a exit(1).
In tests however, we don't and active context will instead be freed quite proper, in which case we need to call stop to ensure we don't crash on shared pointer destruction on wrong shard. Doing so will hide the real issue from whomever runs the test.
Adds some verbosity to track issues with the network proxy used to test EAR connector difficulties. Also adds an earlier close in input stream to help network usage.
Note: This is a diagnostic helper. Still cannot repro the issue above.
Closesscylladb/scylladb#22810
* github.com:scylladb/scylladb:
gcp/aws kms: Promote service_error to recoverable + use malformed_response_error
encryption_at_rest_test: Add verbosity + earlier stream close to proxy
encryption: Add exception handler to context init (for tests)
Replace boost::accumulate() with the standard library's alternatives
to reduce external dependencies and simplify the codebase. This
change eliminates the requirement for boost::range and makes the
implementation more maintainable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22856
Disable seastar's built in handlers for SIGINT and SIGTERM and thus
fall-back to the OS's default handlers, which terminate the process.
This makes tool applications interruptable by SIGINT and SIGTERM.
The default handler just terminates the tool app immediately and
doesn't allow for cleanup, but this is fine: the tools have no important
data to save or any critical cleanup to do before exiting.
Fixes: scylladb/scylladb#16954Closesscylladb/scylladb#22838
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.
This PR first refactors the reclaim and reload logic into a single
background fiber. It then updates the sstable manager to subscribe to
changes in the `components_memory_reclaim_threshold` configuration value
and immediately triggers the reclaim/reload fiber when a change is
detected.
Fixes#21947
This is an improvement and does not need to be backported.
Closesscylladb/scylladb#22725
* github.com:scylladb/scylladb:
sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update
sstables_manager: maybe_reclaim_components: yield between iterations
sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()`
sstables_manager: move reclaim logic into `components_reclaim_reload_fiber()`
sstables_manager: rename `_sstable_deleted_event` condition variable
sstables_manager: rename `components_reloader_fiber()`
sstables_manager: fix `maybe_reclaim_components()` indentation
sstables_manager: reclaim components memory until usage falls below threshold
sstables_manager: introduce `get_components_memory_reclaim_threshold()`
sstables_manager: extract `maybe_reclaim_components()`
sstables_manager: fix `maybe_reload_components()` indentation
sstables_manager: extract out `maybe_reload_components()`
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.
This patch updates the sstable manager to subscribe to any changes in
the above mentioned config value and immediately trigger the
reclaim/reload fiber when a change occurs. Also, adds a testcase to
verify the fix.
Fixes#21947
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Refs #22628
Mark problems parsing response (partial message, network error without exception etc
- hello testing), as "malformed_response_error", and promote this as well as
general "service_error" to recoverable exceptions (don't isolate node on error).
This to better handle intermittent network issues as well as making error-testing
more deterministic.
Refs #22628
Adds some verbosity to track issues with the network proxy used to test
EAR connector difficulties. Also adds an earlier close in input stream
to help network usage.
Note: This is a diagnostic helper. Still cannot repro the issue above.
Adds exception handler + cleanup for the case where we have a
bad config/env vars (hint minio) or similar, such that we fail
with exception during setting up the EAR context.
In a normal startup, this is ok. We will report the exception,
and the do a exit(1).
In tests however, we don't and active context will instead be
freed quite proper, in which case we need to call stop to ensure
we don't crash on shared pointer destruction on wrong shard.
Doing so will hide the real issue from whomever runs the test.
Serializing `raft::append_request` for transmission requires approximately the same amount of memory as its size. This means when the Raft library replicates a log item to M servers, the log item is effectively copied M times. To prevent excessive memory usage and potential out-of-memory issues, we limit the total memory consumption of in-flight `raft::append_request` messages.
Fixesscylladb/scylladb#14411Closesscylladb/scylladb#22835
* github.com:scylladb/scylladb:
raft_rpc::send_append_entries: limit memory usage
fms: extract entry_size to log_entry::get_size
Allows querying the content of sstables. Simple queries can be
constructed on the command-line. More advanced queries can be passed in
a file. The output can be text (similar to CQLSH) or json (similar to
SELECT JSON).
Uses a cql_test_env behind the scenes to set-up a query pipeline. The
queried sstables are not registered into cql_test_env, instead they are
queried via the virtual-table interface. This is to isolate the sstables
from any accidental modifications cql_test_env might want to do to them.
tool_app_template::run() calls get_selected_operation() to obtain the
operation (command) the user selected. To do this,
get_selected_operation() does a CLI pre-parsing pass, with a minimal
boost::program_options, so things like mixed positional/non-positional
args are correctly handled. This code use `sstring` for generic
operation-options. The problem is that boost doesn't allow values with
spaces inside for non-std::string types. This therefore prevents such
values from being used for any option downstream, because parsing would
fail at this stage. Change the type to std::string to solve this
problem.
Exposes the RawValue() method of the underlying rapidjson::Writer. This
method allows writing a pre-formatted json value to the stream. This
will allow using cql3/type_json.hh to pre-format CQL3 types, then write
these pre-formatted values into a json stream.
This variant of do_with_cql_env(), forgoes the reentrancy support in the
regular do_with_cql_env() variants, and re-uses the caller's exsting
seastar thread. This is an optimized version for callers which don't
need reentrancy and already have a thread.
The test simulates the cluster getting stuck during upgrade to raft
topology due to majority loss, and then verifies that it's possible
to get out of the situation by performing recovery and redoing the
upgrade.
Fixes: scylladb/scylladb#17410
Python 3.11 introduces asyncio.TaskGroup, which I would like to use in a
test that I'll introduce in the next commit. Modify the python version
check in test.py to prevent from accidentally running with an older
version of python.
The gather_safely function was originally defined in the
test.pylib.scylla_cluster module, but it is a generic concurrency
combinator which is not tied to the concept of Scylla clusters at all.
Move it to test.pylib.util to make this fact more clear.
In legacy topology mode, on startup, a node will attempt to insert data
of the newest CDC generation into the legacy distributed tables. In case
of any errors, the operation will be retried until success in 60s
intervals. While the node waits for the operation to be retried, it
keeps a token_metadata_ptr instance. This is a problem for two reasons:
- The tmptr instance is used in a lambda which determines the cluster
size. This lambda is used to determine the consistency level when
inserting the generation to the distributed tables - if there is only
one node, CL=ONE should be used instead of CL=QUORUM. The tmptr is
immutable so it can technically happen the the cluster is shrinked
while the code waits for the generation to be inserted.
- Token metadata instance keeps a version tracker that which prevents
topology operations from proceeding while the tracker exists. This is
a very niche problem, but it might happen that a leftover instance of
token metadata held by update_streams_description might delay a
topology operation which happens after upgrade to raft topology
happens. This actually slows down the test which simulates upgrade to
raft topology getting stuck (to be introduced in later commits).
Instead of capturing a token_metadata_ptr instance, capture a reference
to shared_token_metadata and use a freshly issued token_metadata_ptr
when computing the cluster size in order to choose the consistency
level.
Now, check_system_topology_and_cdc_generations_v3_consistency has an
additional list argument and will ignore hosts from that list if some of
them are found to be in the "left" state. Additionally, the function now
requires that the set of the live hosts in the cluster is exactly
`live_hosts` - no more, no less.
It will be needed for the test which simulates upgrade procedure getting
stuck - "un-stucking" the procedure requires removing some nodes via
legacy removenode procedure which marks them as "left" in gossip, and
then those nodes might get inserted as "left" nodes into raft topology
by the gossiper orphan remover fiber.
Some of the existing tests had to be adjusted because of the changes:
- test_unpublished_cdc_generations_arent_cleared passed only one of the
cluster's live hosts, now it passes all of them.
- test_topology_recovery_after_majority_loss removes some nodes during
the test, so they need to be put into the ignore_nodes list.
- test_topology_upgrade_basic did not include the last-added node to the
check_system_topology_and_cdc_generations_v3_consistency call, now it
does.
It will be needed for a test that simulates the cluster getting stuck
during upgrade. Specifically, it will be used to simulate network
isolation and to prevent raft commands from reaching that node.
The injection will necessary for the test, introduced in the next
commit, which verifies that it's possible to recover from an upgrade of
raft topology which gets stuck.
Reorder member variable initialization sequence to ensure `pw` is accessed
before being moved. While the current use-after-move warning from clang-tidy
is a false positive, this change:
- Makes the initialization order more logical
- Eliminates misleading static analysis warnings
- Prevents potential future issues if class structure changes
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22830
This commit removes the variable _manage_topology_change_kind_from_group0
which was used earlier as a work around for correctly handling
topology_change_kind variable, it was brittle and had some bugs. Earlier commits
made some modifications to deal with handling topology_change_kind variable
post _manage_topology_change_kind_from_group0 removal
This patch adds to the fetch_scylla.py script, used by the "--release"
option of test/{cqlpy,alternator}/run, the ability to download the new
2025.1 releases.
In the new single-stream releases, the number looks like the old
Scylla Enterprise releases, but the location of the artifacts in the
S3 bucket look like the old open-source releases (without the word
"-enterprise" in the paths). So this patch introduces a new "if"
for the (major >= 2025) case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22778
This change adds two log messages. One for the creation of the truncate
global topology request, and another for the truncate timeout. This is
added in order to help with tracking truncate operation events.
It also extends the "Another global topology request is ongoing, please
retry." error message with more information: keyspace and table name.
Currently, we can not have more than one global topology operation at
the same time. This means that we can not have concurrent truncate
operations because truncate is implemented as a global topology
operation.
Truncate excludes with other topology operations, and has to wait for
those to complete before truncate starts executing. This can lead to
truncate timeouts. In these cases the client retries the truncate operation,
which will check for ongoing global topology operations, and will fail with
an "Another global topology request is ongoing, please retry." error.
This can be avoided by truncate checking if we have a truncate for the same
table already queued. In this case, we can wait for the ongoing truncate to
complete instead of immediatelly failing the operation, and provide a better
user experience.
Both, repair and streaming depend on view builder, but since the builder is started too late, both keep sharded<> reference on it and apply `if (view_builder.local_is_initialized())` safety checks.
However, view builder can do its sharded start much earlier, there's currently nothing that prevents it from doing so. This PR moves view builder start up together with some other of its dependencies, and relaxes the way repair and streaming use their view-builder references, in particular -- removes those ugly initialization checks.
refs: scylladb/scylladb#2737Closesscylladb/scylladb#22676
* github.com:scylladb/scylladb:
streaming: Relax streaming::make_streamig_consumer() view builder arg
streaming: Keep non-sharded view_builder dependency reference
streaming: Remove view_builder.local_is_initialized() checks
repair: Keep non-sharded view_builder dependency reference
repair: Remove view_builder.local_is_initialized() checks
main: Start sharded<view_builder> earlier
test/cql_env: Move stream manager start lower
Replace legacy shell test operator (-o) with more portable OR (||) syntax.
Fix fragile file handling in find loop by using while read loop instead.
Warnings fixed:
- SC2166: Replace [ p -o q ] with [ p ] || [ q ]
- SC2044: Replace for loop over find with while read loop
While no issues were observed with the current code, these changes improve
robustness and portability across different shell environments.
also, set the pipefail option, so that we can catch the unexpected
failure of `find` command call.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22385
This helper now fully duplicates the validate_table() one, so it
can be removed. Two callers are updated respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places that validate_table() and then call
database::find_column_family(ks, cf) which goes and repeats the search
done by validate_table() before that.
To remove the unneeded work, re-use the table_id found by
validate_table() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper calls database::find_column_family() and ignores the result.
The intention of this is just to check if the c.f. in question exists.
The find_column_family() in turn calls find_uuid() and then finds the
c.f. object using the uuid found. The latter search is not supposed to
fail, if it does, the on_internal_error() is called.
Said that, replacing find_column_family() with find_uuid() is
idempotent. And returning the found table_id will be used by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to be in-sync with another get_uuid() helper from API. This, in
turn, is to ease the unification of those two, because they are
effectively identical (see next patches)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is redundant with reader_permit::impl::_ttl_timer. Use the latter for
TTL of inactive reads too. The usage of the two exclude each other, at
any point in time, either one or the other is used, so no reason to keep
both.
Closesscylladb/scylladb#22863
Demote do-nothing decisions to debug level, but keep them at info
if we did decide to do nothing (such as migrate a tablet). Information
about more major events (like split/merge) is kept at info level.
Once log line that logs node information now also logs the datacenter,
which was previously supplied by a log line that is now debug-only.
Closesscylladb/scylladb#22783
Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support.
Fixes https://github.com/scylladb/scylladb/issues/22417
New feature. No backport is needed.
Closesscylladb/scylladb#22621
* github.com:scylladb/scylladb:
test: add test to check dcs and hosts repair filter
test: add repair dc selection to test_tablet_metadata_persistence
repair: Introduce Host and DC filter support
docs: locator: update the docs and formatter of tablet_task_info
result_set_row is a heavyweight object containing multiple cell types:
regular columns, partition keys, and static values. To prevent expensive
accidental copies, delete the copy constructor and replace it with:
1. A move constructor for efficient vector reallocation
2. An explicit copy() method when copies are actually needed
This change reduces overhead in some non-hot paths by eliminating implicit
deep copies. Please note, previously, in `create_view_from_mutation()`,
we kept a copy of `result_set_row`, and then reused `table_rs` for
holding the mutation for `scylla_tables`. Because we don't copy
the `result_set_row` in this change, in order to avoid invalidating
the `row` after reusing `table_rs` in the outer scope, we define a
new `table_rs` shadowing the one in the out scope.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22741
The existing test measures latencies of object GET-s. That's nice (though incomplete), but we want to measure upload performance. Here it is.
refs: #22460Closesscylladb/scylladb#22480
* github.com:scylladb/scylladb:
test/perf/s3: Add --part-size-mb option for upload test
test/perf/s3: Add uploading test
test/perf/s3: Some renames not to be download-centric
test/perf/s3: Make object/file name configurable
test/perf/s3: Configure maximum number of sockets
test/perf/s3: Remove parallelizm
s3/client: Make http client connections limit configurable
Given two sets of equivalent types, return the set
intersection.
This is a generic function which adapts to the actual
input type.
A unit test is added.
Closesscylladb/scylladb#22763
this PR is propper(pythonic) chance of commit 288a47f815
Creating an own folder used to be needed for two reasons:
we want a separate test suite, with its own settings
we want to structure tests, e.g. tablets, raft, schema, gossip.
We've been creating many folders recently. However, test suite
infrastructure is expensive in test.py - each suite has its own
pool of servers, concurrency settings and so on.
Make it possible to structure tests without too many suites,
by supporting subfolders within a suite.
As an example, this PR move mv tests into a separate folder
custom test.py lookup also works.
tests can be run as:
1. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets/test_mv_tablets_empty_ip
2. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets
3. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv
Fixes https://github.com/scylladb/scylladb/issues/20570Closesscylladb/scylladb#22816
* github.com:scylladb/scylladb:
test.py: move mv tests into a separate folder
test.py: suport subfolders
Seems tox is not used anywhere, so there is no need to have it then.
Especially when it messes with pytest. In some cases it can change the
config dir in pytest run.
Closesscylladb/scylladb#22819
Table updates that try to enable stream (while changing or not the
StreamViewType) on a table that already has the stream enabled
will result in ValidationError.
Table updates that try to disable stream on a table that does not
have the stream enabled will result in ValidationError.
Add two tests to verify the above.
Mark the test for changing the existing stream's StreamViewType
not to xfail.
Fixesscylladb/scylladb#6939Closesscylladb/scylladb#22827
Switch to using schema_ptr wrapper when handling schema references in
scylla_read_stats function. The existing fallback for older versions
(where schema is already a raw pointer) remains preserved.
Fixes#18700
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#22726
Fixes#22688
If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.
Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.
v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some
Closesscylladb/scylladb#22693
The test
topology_custom/test_alternator::test_localnodes_broadcast_rpc_address
sets up nodes with a silly "broadcast rpc address" and checks that
Alternator's "/localnodes" requests returns it correctly.
The problem is that although we don't use CQL in this test, the test
framework does open a CQL connection when the test starts, and closes
it when it ends. It turns out that when we set a silly "broadcast RPC
address", the driver tends to try to connect to it when shutting down,
I'm not even sure why. But the choice of the silly address was 1.2.3.4
is unfortunate, because this IP address is actually routable - and
the driver hangs until it times out (in practice, in a bit over two
minutes). This trivial patch changes 1.2.3.4 to 127.0.0.0 - and equally
silly address but one to which connections fail immediately.
Before this patch, the test often takes more than 2 minutes to finish
on my laptop, after this patch, it always finishes in 4-5 seconds.
Fixes#22744
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22746
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.
Fixes: #22770Closesscylladb/scylladb#22771
This PR addresses two related issues in our task system:
1. Prepares for asynchronous resource cleanup by converting release_resources() to a coroutine. This refactoring enables future improvements in how we handle resource cleanup.
2. Fixes a cross-shard resource cleanup issue in the SSTable loader where destruction of per-shard progress elements could trigger "shared_ptr accessed on non-owner cpu" errors in multi-shard environments. The fix uses coroutines to ensure resources are released on their owner shards.
Fixes#22759
---
this change addresses a regression introduced by d815d7013c, which is contained by 2025.1 and master branches. so it should be backported to 2025.1 branch.
Closesscylladb/scylladb#22791
* github.com:scylladb/scylladb:
sstable_loader: fix cross-shard resource cleanup in download_task_impl
tasks: make release_resources() a coroutine
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22857
In a rolling upgrade, nodes that weren't upgraded yet will not recognize
the new tablet_resize_finalization state, that serves both split and
merges, leading to a crash. To fix that, coordinator will pick the
old tablet_split_finalization state for serving split finalization,
until the cluster agrees on merge, so it can start using the new
generic state for resize finalization introduced in merge series.
Regression was introduced in e00798f.
Fixes#22840.
Reported-by: Tomasz Grabiec <tgrabiec@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22845
The script fetch_scylla.py is used by the "--release" option of
test/cqlpy/run and test/alternator/run to fetch a given release of
Scylla. The release is fetched from S3, and the script assumed that the
user properly set up $HOME/.aws/config and $HOME/.aws/credentials
to determine the source of that download and the credentials to do this.
But this is unnecessary - Scylla's "downloads.scylladb.com" bucket
actually allows **anonymous** downloads, and this is what we should use.
After this patch, fetch_scylla.py (and the "--release" option of the
run scripts) work correctly even for a user that doesn't have $HOME/.aws
set up at all.
This fix is especially important to new developers, who might not even
have AWS credentials to put into these files.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22773
Two callers of it -- repair and stream-manager -- both have non-sharded
reference and can just use it as argument. The helper in question gets
sharded<> one by itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous path -- view builder is started early
enough and construction of stream manager can happen with non-sharded
reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous path -- view builder is started early
enough and construction of repair service can happen with non-sharded
reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The view_builder service is needed by repair service, but is started
after it. It's OK in a sense that repair service holds a sharded
reference on it and checks whether local_is_initialized() before using
it, which is not nice.
Fortunately, starting sharded view buidler can be done early enough,
because most of its dependencies would be already started by that time.
Two exceptions are -- view_update_generator and
system_distributed_keyspace. Both can be moved up too with the same
justification.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to keep it in-sync with main code, where stream manager is
started after storage_proxy's and query_processor's remotes. This
doesn't change nothing for now, but next patches will move other
services around main/cql_test_env and early start of stream manager in
cql_test_env will be problematic.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Renamed the aboved mentioned method to `increment_total_reclaimable_memory()`
as it doesn't directly reclaim memory anymore.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Move the sstable reclaim logic into `components_reclaim_reload_fiber()`
in preparation for the fix for #21947. This also simplifies the overall
reclaim/reload logic by preventing multiple fibers from attempting to
reclaim/reload component memory concurrently.
Also, update the existing test cases to adapt to this change.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Rename the `_sstable_deleted_event` condition variable to
`_components_memory_change_event` as it will be used by future patches
to signal changes in sstable component memory consumption, (i.e.)
during sstable create and delete, and also when the
`components_memory_reclaim_threshold` config value is changed.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
A future patch will move components reclaim logic into the current
`components_reloader_fiber()`, so to reflect its new purpose, rename it
to `components_reclaim_reload_fiber()`.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The current implementation reclaims memory from SSTables only when a new
SSTable is created. An upcoming patch will move this reclaim logic into
the existing component reloader fiber. To support this change, the
`maybe_reclaim_components()` method is updated to reclaim memory until
the total memory consumption falls below the configured threshold.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Introduce `get_components_memory_reclaim_threshold()`, which returns
the components' memory threshold based on the total available memory.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Extract the code from
`increment_total_reclaimable_memory_and_maybe_reclaim()` that reclaims
the components memory into `maybe_reclaim_components()`. The extracted
new method will be used by a following patch to handle reclaim within
the components reload fiber.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Extract the logic that reloads reclaimed components into memory in the
`components_reloader_fiber()` method into a separate method. This is in
preparation for moving the reclaim logic into the same fiber.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Test now uses default internal part size, but for performance
comparisons its good to make it configurable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test picks up a file and uploads it into the bucket, then prints the
time it took and uploading speed. For now it's enough, with existing S3
latencies more timing details can be obtained by turning on trace
logging on s3 logger.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now this test is all about reading objects. Rename some bits in it so
that they can be re-used by future uploading test as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the download test first creates a temporary object and then reads
data from it. It's good to have an option to download pre-existing file.
This option will also be used for uploading test (next patches)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the --sockets NR option that limits the number of sockets the
underlying http client is configured to have.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test spawns several fibers that read the same file in parallel.
There's not much point in it, just makes the code harder to maintain.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to allow concurrent truncate table operations (for the time being,
only for a single table) we have to remove the limitation allowing only one
truncate table fiber per shard.
This change adds the ability to collect the active truncate fibers in
storage_proxy::remote into std::list<> instead of having just a single
truncate fiber. These fibers are waited for completion during
storage_proxy::remote::stop().
In the current scenario, topology_change_kind variable, was been handled using
_manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)
Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.
Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.
By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.
Fixes: scylladb/scylladb#21114
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.
Fixes#22417
Previously, download_task_impl's destructor would destroy per-shard progress
elements on whatever shard the task was destroyed on. In multi-shard
environments, this caused "shared_ptr accessed on non-owner cpu" errors when
attempting to free memory allocated on a different shard.
Fix by:
- Convert progress_per_shard into a sharded service
- Stop the service on owner shards during cleanup using coroutines
- Add operator+= to stream_progress to leverage seastar's built-in adder
instead of a custom adder struct
Alternative approaches considered:
1. Using foreign_ptr: Rejected as it would require interface changes
that complicate stream delegation. foreign_ptr manages the underlying
pointee with another smart pointer but does not expose the smart
pointer instance in its APIs, making it impossible to use
shared_ptr<stream_progress> in the interface.
2. Using vector<stream_progress>: Rejected for similar interface
compatibility reasons.
This solution maintains the existing interfaces while ensuring proper
cross-shard cleanup.
Fixesscylladb/scylladb#22759
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Convert tasks::task_manager::task::impl::release_resources() to a coroutine
to prepare for upcoming changes that will implement asynchronous resource
release.
This is a preparatory refactoring that enables future coroutine-based
implementation of resource cleanup logic.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The timeout of 10 seconds is too small for CI.
I didn't mean to make it so short, it was an accident.
Fix that by changing the timeout to 10 minutes.
Fixesscylladb/scylladb#22832Closesscylladb/scylladb#22836
Token metadata API now depend on gossiper to do ip to host id mappings,
so initialized it after the gossiper is initialized and de-initialized
it before gossiper is stopped.
Fixes: scylladb/scylladb#22743Closesscylladb/scylladb#22760
We need to allow replacing nodetool from scylla-enterprise-tools < 2024.2,
just like we did for scylla-tools < 5.5.
This is required to make packages able to upgrade from 2024.1.
Fixes#22820Closesscylladb/scylladb#22821
Serializing raft::append_request for transmission requires approximately
the same amount of memory as its size. This means when the Raft
library replicates a log item to M servers, the log item is
effectively copied M times. To prevent excessive memory usage
and potential out-of-memory issues, we limit the total memory
consumption of in-flight raft::append_request messages.
Fixes [scylladb/scylladb#14411](https://github.com/scylladb/scylladb/issues/14411)
test_complex_null_values is currently flaky, causing many failures
in CI. The reason for the failures is unclear, and a fix might not
be simple, so because UDFs are experimental, for now let's skip
this test until the corresponding issue is fixed.
Refs scylladb/scylladb#22799Closesscylladb/scylladb#22818
Now that we support suite subfolders,
As an example, this commit move mv tests into a separate folder
custom test.py lookup also works.
tests can be run as:
1. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets/test_mv_tablets_empty_ip
2. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets
3. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv
Creating an own folder used to be needed for two reasons:
- we want a separate test suite, with its own settings
- we want to structure tests, e.g. tablets, raft, schema, gossip.
We've been creating many folders recently. However, test suite
infrastructure is expensive in test.py - each suite has its own
pool of servers, concurrency settings and so on.
Make it possible to structure tests without too many suites,
by supporting subfolders within a suite.
Fixes#20570
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.
Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.
Fixes: scylladb/scylladb#21907Fixes: scylladb/scylladb#21714Closesscylladb/scylladb#21910
When building Scylla with ThinLTO enabled (default with Clang), the linker
spawns threads equal to the number of CPU cores during linking. This high
parallelism can cause out-of-memory (OOM) issues in CI environments,
potentially freezing the build host or triggering the OOM killer.
In this change:
1. Rename `LINK_MEM_PER_JOB` to `Scylla_RAM_PER_LINK_JOB` and make it
user-configurable
2. Add `Scylla_PARALLEL_LINK_JOBS` option to directly control concurrent
link jobs (useful for hosts with large RAM)
3. Increase the default value of `Scylla_PARALLEL_LINK_JOBS` to 16 GiB
when LTO is enabled
4. Default to 2 parallel link jobs when LTO is enabled if the calculated
number if less than 2 for faster build.
Notes:
- Host memory is shared across job pools, so pool separation alone doesn't help
- Ninja lacks per-job memory quota support
- Only affects link parallelism in LTO-enabled builds
See
https://clang.llvm.org/docs/ThinLTO.html#controlling-backend-parallelismFixesscylladb/scylladb#22275
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22383
Oversized materialized view and index names are rejected;
Materialized view names with invalid symbols are rejected.
fixes: #20755Closesscylladb/scylladb#21746
The Intel Optimizaton Manual states that branches with relative offsets
greater than 2GB suffer a penalty. They cite a 6% improvement when this
is avoided. Our code doesn't rely heavily on dynamically linked
libraries, so I don't expect a similar win, but it's still better to do
it than not.
Eliminate long branches by asking the dynamic linker to restrict itself
to the lower 4GB of the address space. I saw that it maps libraries
at 1GB+ addresses, so this satisfies the limitation.
Fix is from the Intel Optimization Manual as well.
This change was ported from ScyllaDB Enterprise.
Closesscylladb/scylladb#22498
This command exists but is not registered. There is a test for it, but
it happens to work only because scylla table is a prefix of scylla
tables (another command), so gdb invokes that other command instead.
Replace boost::remove_if() with the standard library's std::erase_if() or std::ranges::remove_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#22788
* github.com:scylladb/scylladb:
service: migrate from boost::range::remove_if() to std::ranges::remove_if
sstable: migrate from boost::remove_if() to std::erase_if()
Replace boost::copy() with the standard library's std::ranges::copy()
to reduce external dependencies and simplify the codebase. This change
eliminates the requirement for boost::range and makes the implementation
more maintainable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22789
Set true to wait for the repair to complete. Set false to skip waiting
for the repair to complete. When the option is not provided, it defaults
to false.
It is useful for management tool that wants the api to be async.
Fixes#22418Closesscylladb/scylladb#22436
prepare helpfully prints out the path where optimized clang is stored,
but a couple of typos mean it prints out an empty string. Fix that.
Closesscylladb/scylladb#22714
Developers are expected to run new cqlpy tests against Cassandra - to
verify that the new test itself is correct. Usually there is no need
to run the entire cqlpy test suite against Cassandra, but when users do
this, it isn't confidence-inspiring to see hundreds of tests failing.
In this patch I fix many but not all of these failures.
Refs #11690 (which will remain open until we fix all the failures on
Cassandra)
* Fixed the "compact_storage" fixture recently introduced to enable the
deprecated feature in Scylla for the tests. This fixture was broken on
Cassandra and caused all compact-storage related tests to fail
on Cassandra.
* Marked all tests in test_tombstone_limit.py as scylla_only - as they
check the Scylla-only query_tombstone_page_limit configuration option.
* Marked all tests in test_service_level_api.py as scylla_only - as they
check the Scylla-only service levels feature.
* Marked a test specific to the Scylla-only IncrementalCompactionStrategy
as scylla_only. Some tests mix STCS and ICS testing in one test - this
is a mistake and isn't fixed in this patch.
* Various tests in test_tablets.py forgot to use skip_without_tablets
to skip them on Cassandra or older Scylla that doesn't have the
tablets feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
x
Closesscylladb/scylladb#22561
Use host_id in a children list of a task in task manager to indicate
a node on which the child was created.
Move TASKS_CHILDREN_REQUEST to IDL. Send it by host_id.
Fixes: https://github.com/scylladb/scylladb/issues/22284.
Ip to host_id transition; backport isn't needed.
Closesscylladb/scylladb#22487
* github.com:scylladb/scylladb:
tasks: drop task_manager::config::broadcast_address as it's unused
tasks: replace ip with host_id in task_identity
api: task_manager: pass gossiper to api::set_task_manager
tasks: keep host_id in task_manager
tasks: move tasks_get_children to IDL
The series fixes a regression and demotes a barrier_and_drain logging error to a warning since this particular condition may happen during normal operation.
We want to backport both since one is a bug fix and another is trivial and reduces CI flakiness.
Closesscylladb/scylladb#22650
* https://github.com/scylladb/scylladb:
topology_coordinator: demote barrier_and_drain rpc failure to warning
topology_coordinator: read peers table only once during topology state application
In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is **not** the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes#11567.
We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes#11471.
This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. **It should be backported to 2025.1**, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1.
Closesscylladb/scylladb#21989
* github.com:scylladb/scylladb:
alternator: fix view build on oversized GSI key attribute
mv: clean up do_delete_old_entry
test/alternator: unflake test for IndexStatus
test/alternator: work around unrelated bug causing test flakiness
docs/alternator: adding a GSI is no longer an unimplemented feature
test/alternator: remove xfail from all tests for issue 11567
alternator: overhaul implementation of GSIs and support UpdateTable
mv: support regular_column_transformation key columns in view
alternator: add new materialized-view computed column for item in map
build: in cmake build, schema needs alternator
build: build tests with Alternator
alternator: add function serialized_value_if_type()
mv: introduce regular_column_transformation, a new type of computed column
alternator: add IndexStatus/Backfilling in DescribeTable
alternator: add "LimitExceededException" error type
docs/alternator: document two more unimplemented Alternator features
Replace boost::range::remove_if() with the standard library's
std::ranges::remove_if() to reduce external dependencies and simplify
the codebase. This change eliminates the requirement for boost::range
and makes the implementation more maintainable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace boost::remove_if() with the standard library's std::erase_if()
to reduce external dependencies and simplify the codebase. This change
eliminates the requirement for boost::range and makes the implementation
more maintainable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We are supposed to be loading the most recent RPC compression dictionary
on startup, but we forgot to port the relevant piece of logic during
the source-available port. This causes a restarted node not to use the
dictionary for RPC compression until the next dictionary update.
Fix that.
Fixesscylladb/scylladb#22738
This is more of a bugfix than an improvement, so it should be backported to 2025.1.
Closesscylladb/scylladb#22739
* github.com:scylladb/scylladb:
test_rpc_compression.py: test the dictionaries are loaded on startup
raft/group0_state_machine: load current RPC compression dict on startup
Before these changes, shutting down a node could be prolonged because of
mapreduce_service. `mapreduce_service::stop()` uninitializes messaging
service, which includes waiting for all ongoing RPC handlers. We already
had a mechanism for cancelling local mapreduce tasks, but we were missing
one for cancelling external queries.
In this commit, we modify the signature of the request so it supports
cancelling via an abort source. We also provide a reproducer test
for the problem.
Fixesscylladb/scylladb#22337Closesscylladb/scylladb#22651
The following is observed in pytest:
1) node1, stream master, tried to pull data from node3
2) node3, stream follower, found node1 restarted
3) node3 killed the rpc stream
4) node1 did not get the stream session failure message from node3. This
failure message was supposed to kill the stream plan on node1. That's the
reason node1 failed the stream session much later at "2024-08-19 21:07:45,539".
Note, node3 failed the stream on its side, so it should have sent the stream
session failure message.
```
$ cat node1.log |grep f890bea0-5e68-11ef-99ae-e5bca04385fc
INFO 2024-08-19 20:24:01,162 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Executing streaming plan for Tablet migration-ks-index-0 with peers={127.0.34.3}, master
ERROR 2024-08-19 20:24:01,190 [shard 1:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=127.0.34.3: seastar::nested_exception: seastar::rpc::stream_closed (rpc stream was closed by peer) (while cleaning up after seastar::rpc::stream_closed (rpc stream was closed by peer))
WARN 2024-08-19 21:07:45,539 [shard 0:main] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming plan for Tablet migration-ks-index-0 failed, peers={127.0.34.3}, tx=0 KiB, 0.00 KiB/s, rx=484 KiB, 0.18 KiB/s
$ cat node3.log |grep f890bea0-5e68-11ef-99ae-e5bca04385fc
INFO 2024-08-19 20:24:01,163 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Executing streaming plan for Tablet migration-ks-index-0 with peers=127.0.34.1, slave
INFO 2024-08-19 20:24:01,164 [shard 1:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Start sending ks=ks, cf=cf, estimated_partitions=2560, with new rpc streaming
WARN 2024-08-19 20:24:01,187 [shard 0: gms] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming plan for Tablet migration-ks-index-0 failed, peers={127.0.34.1}, tx=633 KiB, 26506.81 KiB/s, rx=0 KiB, 0.00 KiB/s
WARN 2024-08-19 20:24:01,188 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] stream_transfer_task: Fail to send to 127.0.34.1:0: seastar::rpc::stream_closed (rpc stream was closed by peer)
WARN 2024-08-19 20:24:01,189 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Failed to send: seastar::rpc::stream_closed (rpc stream was closed by peer)
WARN 2024-08-19 20:24:01,189 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming error occurred, peer=127.0.34.1
```
To be safe in case the stream fail message is not received, node1 could fail
the stream plan as soon as the rpc stream is aborted in the
stream_mutation_fragments handler.
Fixes#20227Closesscylladb/scylladb#21960
bash error handling and reporting is atrocious. Without -e it will
just ignore errors. With -e it will stop on errors, but not report
where the error happened (apart from exiting itself with an error code).
Improve that with the `trap ERR` command. Note that this won't be invoked
on intentional error exit with `exit 1`.
We apply this on every bash script that contains -e or that it appears
trivial to set it in. Non-trivial scripts without -e are left unmodified,
since they might intentionally invoke failing scripts.
Closesscylladb/scylladb#22747
The partition key had been renamed and its type changed some time ago,
but the doc wasn't updated. Fix it.
refs: #20998
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#22683
This PR converts boost load balancer tests in preparation for load balancer changes
which add per-table tablet hints. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.
Topology is created in tests via group0 commands, which is abstracted by
the new `topology_builder` class.
Tests cannot modify token_metadata only in memory now as it needs to be
consistent with the schema and on-disk metadata. That's why modifications to
tablet metadata are now made under group0 guard and save back metadata to disk.
Closesscylladb/scylladb#22648
* github.com:scylladb/scylladb:
test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario
tests: tablets: Set initial tablets to 1 to exit growing mode
test: tablets_test: Create proper schema in load balancer tests
test: lib: Introduce topology_builder
test: cql_test_env: Expose topology_state_machine
topology_state_machine: Introduce lock transition
This patch removes expansion of "SELECT *" in DESC MATERIALIZED VIEW.
Instead of explicitly printing each column, DESC command will now just
use SELECT *, if view was created with it. Also, adds a correspodning test.
Fixes#21154Closesscylladb/scylladb#21962
Replace boost::algorithm::all_of_equal() to std::ranges::all_of()
In order to reduce the header dependency to boost ranges library, let's
use the utility from the standard library when appropriate.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22730
In few test cases of test_view_build_status we create a view, wait for
it and then query the view_build_status table and expect it to have all
rows for each node and view.
But it may fail because it could happen that the wait_for_view query and
the following queries are done on different nodes, and some of the nodes
didn't apply all the table updates yet, so they have missing rows.
To fix it, we change the assert to work in the eventual consistency
sense, retrying until the number of rows is as expectd.
Fixesscylladb/scylladb#22644Closesscylladb/scylladb#22654
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error. This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
Error: Input required and not supplied: token
```
This commit addresses the issue by:
- Running the workflow in the base repository instead of the fork. This
grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
controlexecution, as recommended in the GitHub Actions documentation
(https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.
This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.
Fixesscylladb/scylladb#22765
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22766
In c5668d99, a new source file row_cache.cc was added to the `db` target,
but with an extraneous trailing comma. In CMake's target_sources(),
source files should be space-separated - any comma is interpreted as
part of the filename, causing build failures like:
```
CMake Error at db/CMakeLists.txt:2 (target_sources):
Cannot find source file:
row_cache.cc,
```
Fix the issue by removing the trailing comma.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22754
Add the possibility to run boost and unit tests with pytest
test.py should follow the next paradigm - the ability to run all test cases sequentially by ONE pytest command.
With this paradigm, to have the better performance, we can split this 1 command into 2,3,4,5,100,200... whatever we want
It's a new functionality that does not touch test.py way of executing the boost and unit tests.
It supports the main features of test.py way of execution: automatic discovery of modes, repeats.
There is an additional requirement to execute tests in parallel: pytest-xdist. To install it, execute `pip install pytest-xdist`
To run test with pytest execute `pytest test/boost`. To execute only one file, provide the path filename `pytest test/boost/aggregate_fcts_test.cc` since it's a normal path, autocompletion will work on the terminal. To provide a specific mode, use the next parameter `--mode dev`, if parameter will not be provided pytest will try to use `ninja mode_list` to find out the compiled modes.
Parallel execution controlled by pyest-xdist and the parameter `-n 12`.
The useful command to discover the tests in the file or directory is `pytest --collect-only -q --mode dev test/boost/aggregate_fcts_test.cc`. That will return all test functions in the file. To execute only one function from the test, you can invoke the output from the previous command, but suffix for mode should be skipped, for example output will be `test/boost/aggregate_fcts_test.cc::test_aggregate_avg.dev`, so to execute this specific test function, please use the next command `pytest --mode dev test/boost/aggregate_fcts_test.cc::test_aggregate_avg`
There is a parameter `--repeat` that used to repeat the test case several times in the same way as test.py did.
It's not possible to run both boost and unit tests directories with one command, so we need to provide explicitly which directory should be executed. Like this `pytest --mode dev test/unit` or `pytest --mode dev test/boost`
Fixes: https://github.com/scylladb/qa-tasks/issues/1775Closesscylladb/scylladb#21108
* github.com:scylladb/scylladb:
test.py: Add possibility to run ldap tests from pytest
test.py: Add the possibility to run unit tests from pytest
test.py: Add the possibility to run boost test from pytest
test.py: Add discovery for C++ tests for pytest
test.py: Modify s3 server mock
test.py: Add method to get environment variables from MinIO wrapper
test.py: Move get configured modes to common lib
When upgrading for example from `2024.1` to `2025.1` the package name is
not identical casuing the upgrade command to fail:
```
Command: 'sudo DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade scylla -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"'
Exit code: 100
Stdout:
Selecting previously unselected package scylla.
Preparing to unpack .../6-scylla_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb ...
Unpacking scylla (2025.1.0~dev-0.20250118.1ef2d9d07692-1) ...
Errors were encountered while processing:
/tmp/apt-dpkg-install-JbOMav/0-scylla-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/1-scylla-python3_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/2-scylla-server_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/3-scylla-kernel-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/4-scylla-node-exporter_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/5-scylla-cqlsh_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
Stderr:
E: Sub-process /usr/bin/dpkg returned an error code (1)
```
Adding `Obsoletes` (for rpm) and `Replaces` (for deb)
Fixes: https://github.com/scylladb/scylladb/issues/22420Closesscylladb/scylladb#22457
* tools/python3 8415caf4...3e0b8932 (2):
> reloc: collect package files correctly if the package has an optional dependency
> dist: support smooth upgrade from enterprise to source availalbe
Closesscylladb/scylladb#22517
This series extends the table schema with per-table tablet options.
The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions,
when the table size changes.
* New feature, no backport required
Closesscylladb/scylladb#22090
* github.com:scylladb/scylladb:
tablets: resize_decision: get rid of initial_decision
tablet_allocator: consider tablet options for resize decision
tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member
network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0
cql3/create_keyspace_statement: add deprecation warning for initial tablets
test: cqlpy: test_tablets: add tests for per-table tablet options
schema: add per-table tablet options
feature_service: add TABLET_OPTIONS cluster schema feature
`set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature eviction.
Fixes: https://github.com/scylladb/scylladb/issues/22629
Backport required to all active releases, they are all affected.
Closesscylladb/scylladb#22701
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: set_notify_handler(): disable timeout
reader_permit: mark check_abort() as const
Add posibility to run ldap tests with pytest.
LDAP server will be created for each worker if xdist will be used.
For one thread one LDAP server will be used for all tests.
Add the possibility to run boost test from pytest.
Boost facade based on code from https://github.com/pytest-dev/pytest-cpp, but enhanced and rewritten to suite better.
Code based on https://github.com/pytest-dev/pytest-cpp. Updated, customized, enhanced to suit current needs.
Modify generate report to not modify the names, since it will break
xdist way of working. Instead modification will be done in post collect
but before executing the tests.
Add the possibility to return environment as a dict to use it later it subprocess created by xdist, without starting another s3 mock server for each thread.
Add method to retrieve MinIO server wrapper environment variables for
later processing.
This change will allow to sharing connection information with other
processes and allow reusing the server across multiple tests.
This scenario is invoked in a loop in the
test_load_balancing_merge_colocation_with_random_load test case, which
will cause accumulation of tablet maps making each reload slower in
subsequent iterations.
It wasn't a problem before because we overwritten tablet_metadata in
each iteration to contain only tablets for the current table, but now
we need to keep it consistent with the schema and don't do that.
After tablet hints, there is no notion of leaving growing mode and
tablet count is sustained continuously by initial tablet option, so we
need to lower it for merge to happen.
This is in preparation for load balancer changes needed to respect
per-table tablet hints and respecting per-shard tablet count
goal. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.
Will be used by load balancer tests which need more than a single-node
topology, and which want to create proper schema in the database which
depends on that topology, in particular creating keyspaces with
replication factor > 1.
We need to do that because load balancer will use replication strategy
from the database as part of plan making.
Will be used in load balancer tests to prevent concurrent topology
operations, in particular background load balancing.
load balancer will be invoked explicitly by the test. Disabling load
balancer in topology is not a solution, because we want the explicit
call to perform the load balancing.
This patch series contains improvements to our GitHub license header check workflow.
The first patch grants necessary write permissions to the workflow, allowing it to comment directly on pull requests when license header issues are found. This addresses a permissions-related error that previously prevented the workflow from creating comments.
The second patch optimizes the workflow by skipping the license check step when no relevant files have been modified in the pull request. This prevents unnecessary workflow failures that occurred when the check was run without any files to analyze.
Together, these changes make the license header checking process more robust and efficient. The workflow now properly communicates findings through PR comments and avoids running unnecessary checks.
---
no need to backport, as the workflow updated by this change only exists in master.
Closesscylladb/scylladb#22736
* github.com:scylladb/scylladb:
.github: grant write permissions for PR comments in license check workflow
.github: skip license check when no relevant files changed
It's possible to modify 'memtable_flush_period_in_ms' option only and as
single option, not with any other options together
Refs #20999Fixes#21223Closesscylladb/scylladb#22536
Grant write permissions to the check-license-header workflow to enable
commenting on pull requests. This fixes the "Resource not accessible by
integration" HTTP error that occurred when the workflow attempted to
create comments.
The permission is required according to GitHub's API documentation for
creating issue comments.
see also https://docs.github.com/en/rest/issues/comments?apiVersion=2022-11-28#create-an-issue-comment
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Skip the license header check step in `check-license-header.yaml` workflow
when no files with configured extensions were changed in the pull request.
Previously, the workflow would fail in this case since the --files
argument requires at least one file path:
```
check-license.py: error: argument --files: expected at least one argument
```
Add `if` condition to only run the check when steps.changed-files.outputs.files
is not empty.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Until now this action checked if we have a `backport/none` or `backport/x.y` label only, since we moved to the source available and the releases like 2025.1 don't match this regex this action keeps failing
Closesscylladb/scylladb#22734
set_notify_handler() is called after a querier was inserted into the
querier cache. It has two purposes: set a callback for eviction and set
a TTL for the cache entry. This latter was not disabling the
pre-existing timeout of the permit (if any) and this would lead to
premature eviction of the cache entry if the timeout was shorter than
TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature
eviction.
Fixes: #scylladb/scylladb#22629
This patch addresses an issue where the buffer offset becomes incorrect when a request is retried. The new request uses an offset that has already been advanced, causing misalignment. This fix ensures the buffer offset is correctly reset, preventing such errors.
Closesscylladb/scylladb#22729
Before this change, it was ensured that a default superuser is created
before serving CQL. However, the mechanism didn't wait for default
password initialization, so effectively, for a short period, customer
couldn't authenticate as the superuser properily. The purpose of this
change is to improve the superuser initialization mechanism to wait for
superuser default password, just as for the superuser creation.
This change:
- Introduce authenticator::ensure_superuser_is_created() to allow
waiting for complete initialization of super user authentication
- Implement ensure_superuser_is_created in password_authenticator, so
waiting for superuser password initialization is possible
- Implement ensure_superuser_is_create in transitional_authenticator,
so the implementation from password_authenticator is used
- Implement no-op ensure_superuser_is_create for other authenticators
- Extend service::ensure_superuser_is_created to wait for superuser
initialization in authenticator, just as it was implemented earlier
for role_manager
- Add injected error (sleep) in password_authenticator::start to
reproduce a case of delayed password creation
- Implement test_delayed_deafult_password to verify the correctness of the fix
- Ensure superuser is created in single_node_cql_env::run_in_thread to
make single_node_cql more similar to scylla_main in main.cc
Fixesscylladb/scylladb#20566
Backport not needed - a minor bugfix
Closesscylladb/scylladb#22532
* github.com:scylladb/scylladb:
test: implement test_auth_password_ensured
test: implement connect_driver argument in ManagerClient::server_add
auth: ensure default superuser password is set before serving CQL
auth: added password_authenticator_start_pause injected error
We are supposed to be loading the most recent RPC compression dictionary
on startup, but we forgot to port the relevant piece of logic during
the source-available port.
This pull request is an implementation of vector data type similar to one used by Apache Cassandra.
The patch contains:
- implementation of vector_type_impl class
- necessary functionalities similar to other data types
- support for serialization and deserialization of vectors
- support for Lua and JSON format
- valid CQL syntax for `vector<>` type
- `type_parser` support for vectors
- expression adjustments such as:
- add `collection_constructor::style_type::vector`
- rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector`
- vector type encoding (for drivers)
- unit tests
- cassandra compatibility tests
- necessary documentation
Co-authored-by: @janpiotrlakomy
Fixes https://github.com/scylladb/scylladb/issues/19455Closesscylladb/scylladb#22488
* github.com:scylladb/scylladb:
docs: add vector type documentation
cassandra_tests: translate tests covering the vector type
type_codec: add vector type encoding
boost/expr_test: add vector expression tests
expression: adjust collection constructor list style
expression: add vector style type
test/boost: add vector type cql_env boost tests
test/boost: add vector type_parser tests
type_parser: support vector type
cql3: add vector type syntax
types: implement vector_type_impl
Do not merge tablets if that would drop the tablet_count
below the minimum provided by hints.
Split tablets if the current tablet_count is less than
the minimum tablet count calculated using the table's tablet options.
TODO: override min_tablet_count if the tablet count per shard
is greater than the maximum allowed. In this case
the tables tablet counts should be scaled down proportionally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In column_family.cc and storage_service.cc there exist a bunch of helpers that parse and/or validate ks/cf names, and different endpoints use different combinations of those, duplicating the functionality of each other and generating some mess. This PR cleans the endpoints from column_family.cc that parse and validate fully qualified table name (the '$ks:$cf' string).
A visible "improvement" is that `validate_table()` helper usage in the api/ directory is narrowed down to storage_service.cc file only (with the intent to remove that helper completely), and the aforementioned `for_tables_on_all_shards()` helper becomes shorter and tiny bit faster, because it doesn't perform some re-lookups of tables, that had been performed by validation sanity checks before it.
There's more to be done in those helpers, this PR wraps only one part of this mess.
Below is the list of endpoints this PR affects and the tests that validate the changes:
|endpoint|test|
|-|-|
|column_family/autocompaction|rest_api/test_column_family::test_column_family_auto_compaction_table|
|column_family/tombstone_gc|rest_api/test_column_family::test_column_family_tombstone_gc_api|
|column_family/compaction_strategy|rest_api/test_column_family/test_column_family_compaction_strategy|
|compaction_manager/stop_keyspace_compaction/|rest_api/test_compaction_manager::{test_compaction_manager_stop_keyspace_compaction,test_compaction_manager_stop_keyspace_compaction_tables}|
Closesscylladb/scylladb#21533
* github.com:scylladb/scylladb:
api: Hide parse_tables() helper
api: Use parse_table_infos() in stop_keyspace_compaction handler
api: Re-use parse_table_info() in column_family API
api: Make get_uuid() return table_info (and rename)
api: Remove keyspace argument from for_table_on_all_shards()
api: Switch for_table_on_all_shards() to use table_info-s
api: Hide validate_table() helper
api: Tables vector is never empty now in for_table_on_all_shards()
api: Move vectors of tables, not copy
api: Add table validation to set_compaction_strategy_class endpoint
api: Use get_uuid() to validate_table() in column family API
api: Use parse_table_infos() in column family API
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
also, took this opportunity to remove an unused namespace alias. and
add an include which is used actually. please note,
`std::ranges::pop_heap()` and friends are actually provided by
`<algorithm>` not `<ranges>`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22716
Before fix of scylladb#20566, CQL was served irrespectively of default
superuser password creation, which led to an incorrect product behavior
and sporadic test failures. This test verifies race condition of serving
CQL and creating default superuser password. Injected failure is used to
ensure CQL use is attempted before default superuser password creation,
however, the attempt is expected to fail because scylladb#20566 is
fixed. Following that, the injected error is notified, so CQL driver can
be started correctly. Finally, CREATE USER query is executed to confirm
successful superuser authentication.
This change:
- Implement test_auth_password_ensured.py
The test starts a server without expecting CQL serving, because
expected_server_up_state=ServerUpState.HOST_ID_QUERIED and
connect_driver=False. Error password_authenticator_start_pause is
injected to block superuser password setup during server startup.
Next, the test waits for a log to confirm that the code implementing
injected error is reached. When the server startup procedure is
unfinished, some operations might not complete on a first try, so
waiting for driver connection is wrapped in repeat_if_host_unavailable.
This commit introduces connect_driver argument in
ManagerClient::server_add. The argument allow skipping CQL driver
initialization part during server start. Starting a server without
the driver is necessary to implement some test scenarios related
to system initialization.
After stopping a server, ManagerClient::server_start can be used to
start the server again, so connect_driver argument is also added here to
allow preventing connecting the driver after a server restart.
This change:
- Implement connect_driver argument in ManagerClient::server_add
- Implement connect_driver argument in ManagerClient::server_start
Before this change, it was ensured that a default superuser is created
before serving CQL. However, the mechanism didn't wait for default
password initialization, so effectively, for a short period, customer
couldn't authenticate as the superuser properily. The purpose of this
change is to improve the superuser initialization mechanism to wait for
superuser default password, just as for the superuser creation.
This change:
- Introduce authenticator::ensure_superuser_is_created() to allow
waiting for complete initialization of super user authentication
- Implement ensure_superuser_is_created in password_authenticator, so
waiting for superuser password initialization is possible
- Implement ensure_superuser_is_create in transitional_authenticator,
so the implementation from password_authenticator is used
- Implement no-op ensure_superuser_is_create for other authenticators
- Modify service::ensure_superuser_is_created to wait for superuser
initialization in authenticator, just as it was implemented earlier
for role_manager
Fixesscylladb/scylladb#20566
This change:
- Implement password_authenticator_start_pause injected error to allow
deterministic blocking of default superuser password creation
This change facilitates manual testing of system behavior when default
superuser password is being initialized. Moreover, this mechanism will
be used in next commits to implement a test to verify a fix for
erroneous CQL serving before default superuser password creation.
this workflow checks the first 10 lines for
"LicenseRef-ScyllaDB-Source-Available-1.0" in newly introduced files
when a new pull request is created against "master" or "next".
if "LicenseRef-ScyllaDB-Source-Available-1.0" is not found, the
workflow fails. for the sake of simplicity, instead of parsing the
header for SPDX License ID, we just check to see if the
"LicenseRef-ScyllaDB-Source-Available-1.0" is included.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22065
Before this patch, the regular_column_transformation constructor, which
we used in Alternator GSIs to generates a view key from a regular-column
cell, accepted a cell of any size. As a reviewer (Avi) noticed, very
long cells are possible, well beyond what Scylla allows for keys (64KB),
and because regular_column_transformation stores such values in a
contiguous "bytes" object it can cause stalls.
But allowing oversized attributes creates an even more accute problem:
While view building (backfilling in DynamoDB jargon), if we encounter
an oversized (>64KB) key, the view building step will fail and the
entire view building will hang forever.
This patch fixes both problems by adding to regular_column_transformation's
constructor the check that if the cell is 64KB or larger, an empty value
is returned for the key. This causes the backfilling to silently skip
this item, which is what we expect to happen (backfilling cannot do
anything to fix or reject the pre-existing items in the best table).
A test test_gsi_updatetable.py::test_gsi_backfill_oversized_key is
introduced to reproduce this problem and its fix. The test adds a 65KB
attribute to a base table, and then adds GSIs to this table with this
attribute as its partition key or its sort key. Before this patch, the
backfilling process for the new GSIs hangs, and never completes.
After this patch, the backfilling completes and as expected contains
other base-table items but not the item with the oversized attribute.
The new test also passes on DynamoDB.
However, while implementing this fix I realized that issue #10347 also
exists for GSIs. Issue #10347 is about the fact that DynamoDB limits
partition key and sort key attributes to 2048 and 1024 bytes,
respectively. In the fix described above we only handled the accute case
of lengths above 64 KB, but we should actually skip items whose GSI
keys are over 2048 or 1024 bytes - not 64KB. This extra checking is
not handled in this patch, and is part of a wider existing issue:
Refs #10347
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function do_delete_old_entry() had an if() which was supposedly for
the case of collection column indexing, and which our previous patch
that improved this function to support caller-specified deletion_ts
left behind.
As a reviewer noticed, the new tombstone-setting code was in an "else"
of that existing if(), and it wasn't clear what happens if we get to that
else in the collection column indexing. So I reviewed the code and added
breakpoints and realized that in fact, do_delete_old_entry() is never
called for the collection-indexing case, which has its own
update_entry_for_computed_column() which view_updates::generate_update()
calls instead of the do_delete_old_entry() function and its friends.
So it appears that do_delete_old_entry() doesn't need that special
case at all, which simplifies it.
We should eventually simplify this code further. In particular, the
function generate_update() already knows the key of the rows it
adds or deletes so do_delete_old_entry() and its friends don't need
to call get_view_rows() to get it again. But these simplifications
and other will need to come in a later patch series, this one is
already long enough :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test for IndexStatus verifies that on a newly created table and GSI,
the IndexStatus is "ACTIVE". However, in Alternator, this doesn't strictly
need to happen *immediately* - view building, even for an empty table -
can take a short while in debug mode. This make the test test
test_gsi_describe_indexstatus flaky in debug mode.
The fix is to wait for the GSI to become active with wait_for_gsi()
before checking it is active. This is sort of silly and redundant,
but the important point that if the IndexStatus is incorrect this test
will fail, it doesn't really matter whether the wait_for_gsi() or
the DescribeTable assertion is what fails.
Now that wait_for_gsi() is used in two test files, this patch moves it
(and its friend, wait_for_gsi_gone()) to util.py.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The alternator test test_gsi_updatetable.py::test_gsi_delete_with_lsi
Creates a GSI together with a table, and then deletes it. We have a
bug unrelated to the purpose of this test - #9059 - that causes view
building to sometimes crash Scylla if the view is deleted while the
view build is starting. We see specifically in debug builds that even
view building of an *empty* table might not finish before the test
deletes the view - so this bug happens.
Work around that bug by waiting for the GSI to build after creating
the table with the GSI. This shouldn't be necessary (in DynamoDB,
a GSI created with the table always begins ready with the table),
but doesn't hurt either.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patches implemented issue #11567 - adding a GSI to a
pre-existing table. So we can finally remove the mention of this
feature as an "unimplemented feature" in docs/alternator/compatibility.md.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patches fully implemented issue 11567 - supporting
UpdateTable to add or delet a GSI on an existing Alternator table.
All 14 tests that were marked xfail because of this issue now pass,
so this patch removes their xfail. There are no more xfailing tests
referring to this issue.
These 14 tests, most of them in test/alternator/test_gsi_updatetable.py,
cover all aspects of this feature, including adding a GSI, deleting a
GSI, interactions between GSI and LSI, RBAC when adding or deleting a GSI,
data type limitation on an attribute that becomes a GSI key or stops
being one, GSI backfill, DescribeTable and backfill, various error
conditions, and more.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The main goal of this patch is to fully support UpdateTable's ability
to add a GSI to an existing table, and delete a GSI from an existing
table. But to achieve this, this patch first needs to overhaul how GSIs
are implemented:
Remember that in Alternator's data model, key attributes in a table
are stored as real CQL columns (with a known type), but all other
attributes of an item are stored in one map called ":attrs".
* Before this patch, the GSI's key columns were made into real columns
in the table's schema, and the materialized view used that column as
the view's key.
* After this patch, the GSI's key columns usually (when they are not
the base table's keys, and not any LSI's key) are left in the ":attrs"
map, just like any other non-key column. We use a new type of computed
column (added in the previous patch) to extract the desired element from
this map.
This overhaul of the GSI implementation doesn't change anything in the
functionality of GSIs (and the Alternator test suite tries very hard to
ensure that), but finally allows us to add a GSI to an already-existing
table. This is now possible because the GSI will be able to pick up
existing data from inside the ":attrs" map where it is stored, instead
of requiring the data in the map to be moved to a stand-alone column as
the previous implementation needed.
So this patch also finally implements the UpdateTable operations
(Create and Delete) to add or delete a GSI on an existing table,
as this is now fairly straightfoward. For the process of "backfilling"
the existing data into the new GSI we don't need to do anything - this
is just the materialized-view "view building" process that already
exists.
Fixes#11567.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In an earlier patch, we introduced regular_column_transformation,
a new type of computed column that does a computation on a cell in
regular column in the base and returns a potentially transformed cell
(value or deletion, timestamp and ttl). In *this* patch, we wire the
materialized view code to support this new kind of computed column that
is usable as a materialized-view key column. This new type of computed
column is not yet used in this patch - this will come in the next
patch, where we will use it for Alternator GSIs.
Before this patch, the logic of deciding when the view update needs
to create a new row or delete a new one, and which timestamp and ttl
to give to the new row, could depend on one (or two - in Alternator)
cells read from base-table regular columns. In this patch, this logic
is rewritten - the notion of "base table regular columns" is generalized
to the notion of "updatable view key columns" - these are view key
columns that an update may change - because they really are base regular
columns, or a computed function of one (regular_column_transformation).
In some sense, the new code is easier to understand - there is no longer
a separate "compute_row_marker()" function, rather the top-level
generate_update() is now in charge of finding the "updatable view key
columns" and calculate the row marker (timestamp and ttl) as part
of deciding what needs to be done.
But unfortunately the code still has separate code paths for "collection
secondary indexing", and also for old-style column_computation (basically,
only token_column_computation). Perhaps in the future this can be further
simplified.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new computed column class for materialized views,
extract_from_attrs_column_computation
which is Alternator-specific and knows how to extract a value (of a
known type) from an attribute stored in Alternator's map-of-all-nonkey-
attributes ":attrs".
We'll use this new computed column in the next patch to reimplement GSI.
The new computed-column class is based on regular_column_transformation
introduced in the previous patch. It is not yet wired to anything:
The MV code cannot handle any regular_column_transformation yet, and
Alternator will not yet use it to create a GSI. We'll do those things
in the following patches.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch is to cmake what the previous patch was to configure.py.
In the next patch we want to make schema/schema.o depend on
alternator/executor.o - because when the schema has an Alternator
computed column, the schema code needs to construct the computed column
object (extract_from_attrs_column_computation) and that lives in
alternator/executor.o.
In the cmake-based build, all the schema/* objects are put into one
library "libschema.a". But code that uses this library (e.g., tests)
can't just use that library alone, because it depends on other code
not in schema/. So CMakeLists.txt lists other "libraries" that
libschema.a depends on - including for example "cql3". We now need
to add "alternator" to this dependency list. The dependency is marked
"PRIVATE" - schema needs alternator for its own internal uses, but
doesn't need to export alternator's APIs to its own users.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For an unknown (to me) reason, configure.py has two separate source
file lists - "scylla_core" and "alternator". Scylla, and almost all tests,
are compiled with both lists, but just a couple of tests were compiled
with just scylla_core without alternator.
In the next patch we want to make schema/schema.o depened on
alternator/executor.o because when the schema has an Alternator
computed column, the schema code needs to construct the computed column
object (extract_from_attrs_column_computation) and that lives in
alternator/executor.o.
This change will break the build of the two tests that do not include
the Alternator objects. So let's just add the "alternator" dependencies
to the couple of tests that were missing it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch introduces a function serialized_value_if_type() which takes
a serialized value stored in the ":attrs" map, and converts it into a
serialized *CQL* type if it matches a particular type (S, B or N) - or
returns null the value has the wrong type.
We will use this function in the following patch for deserializing
values stored in the ":attrs" map to use them as a materialized view
key. If the value has the right type, it will be converted to the CQL
type and used as the key - but if it has the wrong type the key will
be null and it will not appear in the view. This is exactly how GSI
is supposed to behave.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the patches that follow, we want Alternator to be able to use as a
key for a materialized view (GSI) not a real column from the schema,
but rather an attribute value deserialized from a member of the ":attrs"
map.
For this, we need the ability for materialized view to define a key
column which is computed as function of a real column (":attrs").
We already have an MV feature which we called "computed column"
(column_computation), but it is wholy inadequate for this job:
column_computation can only take a partition key, and produce a value -
while we need it to take a regular column (one member of ":attrs"),
not just the partition key, and return a cell - value or deletion,
timestamp and TTL.
So in this patch we introduce a new type of computed column, which we
called "regular_column_transformation" since it intends to perform some
sort of transformation on a single column (or more accurately, a single
atomic cell). The limitation that this function transforms a single
column only is important - if we had a function of multiple columns,
we wouldn't know which timestamp or ttl it should use for the result
if the two columns had different timestamps or TTLs.
The new class isn't wired to anything yet: The MV code cannot handle
it yet, and the Alternator code will not use it yet.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds the missing IndexStatus and Backfilling fields for the
GSIs listed by a DescribeTable request. These fields allow an application
to check whether a GSI has been fully built (IndexStatus=ACTIVE) or
currently being built (IndexStatus=CREATING, Backfilling=true).
This feature is necessary when a GSI can be added to an existing table
so its backfilling might take time - and the application might want to
wait for it.
One test - test_gsi.py::test_gsi_describe_indexstatus - begins to pass
with this fix, so the xfail tag is removed from it.
Fixes#11471.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to Alternator's api_error type yet another type of
error, api_error::limit_exceeded (error code "LimitExceededException").
DynamoDB returns this error code in certain situations where certain
low limits were exceeded, such as the case we'll need in a following
patch - an UpdateTable that tries to create more than one GSI at once.
The LimitExceededException error type should not be confused with
other similarly-named but different error messages like
ProvisionedThroughputExceededException or RequestLimitExceeded.
In general, we make an attempt to return the same error code that
DynamoDB returns for a given error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Two new features were added to DynamoDB this month - MultiRegionConsistency
and WarmThroughput. Document them as unimplemented - and link to the
relevant issue in our bug tracker - in docs/alternator/compatibility.md.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this change, it was possible to change non-liveupdatable config
parameter without process restart. This erroneous behavior not only
contradicts the documentation but is potentially dangerous, as various
components theoretically might not be prepared for a change of
configuration parameter value without a restart. The issue came from
a fact that liveupdatability verification check was skipped for default
configuration parameters (those without its initial values
in configuration file during process start).
This change:
- Introduce _initialization_completed member in config_file
- Set _initialization_completed=true when config file is processed on
server start
- Verify config_file's initialization status during config update - if
config_file was initialized, prevent from further changes of
non-liveupdatable parameters
- Implement ScyllaRESTAPIClient::get_config() that obtains a current
value of given configuration parameter via /v2/config REST API
- Implement test to confirm that only liveupdatable parameters are
changed when SIGHUP is sent after configuration file change
Function set_initialization_completed() is called only once in main.cc,
and the effect is expected to be visible in all shards, as a side effect
of cfg->broadcast_to_all_shards() that is called shortly after. The same
technique was already used for enable_3_1_0_compatibility_mode() call.
Fixesscylladb/scylladb#5382
No backport - minor fix.
Closesscylladb/scylladb#22655
* github.com:scylladb/scylladb:
test: SIGHUP doesn't change non-liveupdatable configuration
test: implement ScyllaRESTAPIClient::get_config()
config: prevent SIGHUP from changing non-liveupdatable parameters
config: remove unused set_value_on_all_shards(const YAML::Node&)
This update introduces four types of credential providers:
1. Environment variables
2. Configuration file
3. AWS STS
4. EC2 Metadata service
The first two providers should only be used for testing and local runs. **They must NEVER be used in production.**
The last two providers are intended for use on real EC2 instances:
- **AWS STS**: Preferred method for obtaining temporary credentials using IAM roles.
- **EC2 Metadata Service**: Should be used as a last resort.
Additionally, a simple credentials provider chain is created. It queries each provider sequentially until valid credentials are obtained. If all providers fail, it returns an empty result.
fixes: #21828Closesscylladb/scylladb#21830
* github.com:scylladb/scylladb:
docs: update the `object_storage.md` and `admin.rst`
aws creds: add STS and Instance Metadata service credentials providers
aws creds: add env. and file credentials providers
s3 creds: move credentials out of endpoint config
Rather than target_max_tablet_size. We need both the target
as well as max and min tablet sizes, so there is no sense
in keeping the max and deriving the target and the minimum
for the max value.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the keyspace initial_tablets for min_tablet_count, if the latter
isn't set, then take the maximum of the option-based tablet counts:
- min_tablet_count
- and expected_data_size_in_gb / target_tablet_size
- min_per_shard_tablet_count (via
calculate_initial_tablets_from_topology)
If none of the hints produce a positive tablet_count,
fall back to calculate_initial_tablets_from_topology * initial_scale.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Current implementation is inefficient as it calls
get_datacenter_token_owners_ips and then find_node(ep)
while for_each_node easily provides a host_id for
is_normal_token_owner.
Then, since we're interested only in datacenters
configure with a replication factor (but it still might be 0),
simply iterate over the dc->rf map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, if a datacenter has no replication_factor option
we consider its replication factor to be 1 in
calculate_initial_tablets_from_topology, but since
we're not going to have any replica on it, it should
be 0.
This is very minor since in the worst case, it
will pessimize the calculation and calculate a value
for initial_tablets that's higher than it could be.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keyspace `initial` tablets option is deprecated
and may be removed in the future.
Rather than relying on `initial`:0 to always enabled
tablets, explicitly print "enabled":true when tablets
are enabled and initial_tablets=0, same as keyspace_metadata::describe.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Per-table hints should be used instead.
Note: the warning is produced by check_against_restricted_replication_strategies
which is called also from alter_keyspace_statement.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test specifying of per-table tablet options on table creation
and alter table.
Also, add a negative test for atempting to use tablet options
with vnodes (that should fail).
And add a basic test for testing tablet options also with
materialized views.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unlike with vnodes, each tablet is served only by a single
shard, and it is associated with a memtable that, when
flushed, it creates sstables which token-range is confined
to the tablet owning them.
On one hand, this allows for far better agility and elasticity
since migration of tablets between nodes or shards does not
require rewriting most if not all of the sstables, as required
with vnodes (at the cleanup phase).
Having too few tablets might limit performance due not
being served by all shards or by imbalance between shards
caused by quantization. The number of tabelts per table has to be
a power of 2 with the current design, and when divided by the
number of shards, some shards will serve N tablets, while others
may serve N+1, and when N is small N+1/N may be significantly
larger than 1. For example, with N=1, some shards will serve
2 tablet replicas and some will serve only 1, causing an imbalance
of 100%.
Now, simply allocating a lot more tablets for each table may
theoretically address this problem, but practically:
a. Each tablet has memory overhead and having too many tablets
in the system with many tables and many tablets for each of them
may overwhelm the system's and cause out-of-memory errors.
b. Too-small tablets cause a proliferation of small sstables
that are less efficient to acces, have higher metadata overhead
(due to per-sstable overhead), and might exhaust the system's
open file-descriptors limitations.
The options introduced in this change can help the user tune
the system in two ways:
1. Sizing the table to prevent unnecessary tablet splits
and migrations. This can be done when the table is created,
or later on, using ALTER TABLE.
2. Controlling min_per_shard_tablet_count to improve
tablet balancing, for hot tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For example, nodes which are being decommissioned should not be
consider as available capacity for new tables. We don't allocate
tablets on such nodes.
Would result in higher per-shard load then planned.
Closesscylladb/scylladb#22657
in order to reduce the external header dependency, let's switch to
the standardlized std::ranges::min_element().
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22572
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.
Fixes: #22450Closesscylladb/scylladb#22679
One of the design goals of the Alternator test suite (test/alternator)
is that developers should be able to run the tests against some already
running installation by running `cd test/alternator; pytest [--url ...]`.
Some of our presentations and documents recommend running Alternator
via docker as:
docker run --name scylla -d -p 8000:8000 scylladb/scylla:latest
--alternator-port=8000 --alternator-write-isolation=always
This only makes port 8000 available to the host - the CQL port is
blocked. We had a bug in conftest.py's get_valid_alternator_role()
which caused it to fail (and fail every single test) when CQL is
not available. What we really want is that when CQL is not available
and we can't figure out a correct secret key to connect to Alternator,
we just try a connect with a fake key - and hope that the option
alternator-enforce-authorization is turned off. In fact, this is what
the code comments claim was already happening - but we failed to
handle the case that CQL is not available at all.
After this patch, one can run Alternator with the above docker
command, and then run tests against it.
By the way, this provides another way for running any old release of
Scylla and running Alternator tests against it. We already supported
a similar feature via test/alternator/run's "--release" option, but
its implementation doesn't use docker.
Fixes#22591
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22592
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.
Fixes: #22620Closesscylladb/scylladb#22622
The `which` command is typically not installed on cloud OS images
and so requires the user to remember to install it (or to be prompted
by a failure to install it).
Replace it with the built-in `type` that is always there. Wrap it
in a function to make it clear what it does.
Closesscylladb/scylladb#22594
Since mid December, tests started failing with ENOMEM while
submitting I/O requests.
Logs of failed tests show IO uring was used as backend, but we
never deliberately switched to IO uring. Investigation pointed
to it happening accidentaly in commit 1bac6b75dc,
which turned on IO uring for allowing native tool in production,
and picked linux-aio backend explicitly when initializing Scylla.
But it missed that seastar-based tests would pick the default
backend, which is io_uring once enabled.
There's a reason we never made io_uring the default, which is
that it's not stable enough, and turns out we made the right
choice back then and it apparently continue to be unstable
causing flakiness in the tests.
Let's undo that accidental change in tests by explicitly
picking the linux-aio backend for seastar-based tests.
This should hopefully bring back stability.
Refs #21968.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22695
This commit introduces two new credentials providers: STS and Instance Metadata Service. The S3 client's provider chain has been updated to incorporate these new providers. Additionally, unit tests have been added to ensure coverage of the new functionality.
This commit entirely removes credentials from the endpoint configuration. It also eliminates all instances of manually retrieving environment credentials. Instead, the construction of file and environment credentials has been moved to their respective providers. Additionally, a new aws_credentials_provider_chain class has been introduced to support chaining of multiple credential providers.
Keep host_id of a node in task manager. If host_id wasn't resolved
yet, task manager will keep an empty id.
It's a preparation for the following changes.
Before this change, it was possible to change non-liveupdatable config
parameter without process restart. This erroneous behavior not only
contradicts the documentation but is potentially dangerous, as various
components theoretically might not be prepared for a change of
configuration parameter value without a restart. The issue came from
a fact that liveupdatability verification check was skipped for default
configuration parameters (those without its initial values
in configuration file during process start).
This change:
- Introduce _initialization_completed member in config_file
- Set _initialization_completed=true when config file is processed on
server start
- Verify config_file's initialization status during config update - if
config_file was initialized, prevent from further changes of
non-liveupdatable parameters
Fixesscylladb/scylladb#5382
Clean the code validating if a replication strategy can be used.
This PR consists of a bunch of unmerged https://github.com/scylladb/scylladb/pull/20088 commits - the solution to the problem that the linked PR tried to solve has been accomplished in another PR, leaving the refactor commits unmerged. The commits introduced in this PR have already been reviewed in the old PR.
No need to backport, it's just a refactor.
Closesscylladb/scylladb#22516
* github.com:scylladb/scylladb:
cql: restore validating replication strategies options
cql: change validating NetworkTopologyStrategy tags to internal_error
cql: inline abstract_replication_strategy::validate_replication_strategy
cql: clean redundant code validating replication strategy options
Currently, the session ID under which the truncate for tablets request is
running is created during the request creation and queuing. This is a problem
because this could overwrite the session ID of any ongoing operation on
system.topology#session
This change moves the creation of the session ID for truncate from the request
creation to the request handling.
Fixes#22613Closesscylladb/scylladb#22615
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.
Fixes: #22588Closesscylladb/scylladb#22624
This commit refactors the way AWS credentials are managed in Scylla. Previously, credentials were included in the endpoint configuration. However, since credentials and endpoint configurations serve different purposes and may have different lifetimes, it’s more logical to manage them separately. Moving forward, credentials will be completely removed from the endpoint_config to ensure clear separation of concerns.
This change:
- Remove unused set_value_on_all_shards(const YAML::Node&) member
function in class config_file::named_value
The function logic was flawed, in a similar way
named_value<T>::set_value(const YAML::Node& node) is flawed: the config
source verification is insufficient for liveupdatable parameters,
allowing overwriting of non-liveupdatable config parameters (refer to
scylladb#5382). As the function was not used, it was removed instead of
fixing.
As of right now, materialized views (and consequently secondary
indexes), lwt and counters are unsupported or experimental with tablets.
Since by defaults tablets are enabled, training cases using those
features are currently broken.
The right thing to do here is to disable tablets in those cases.
Fixes https://github.com/scylladb/scylladb/issues/22638Closesscylladb/scylladb#22661
`validate_options` needs to be extended with
`topology` parameter, because NetworkTopologyStrategy needs to validate if every
explicitly listed DC is really existing. I did cut corner a bit and
trimmed the message thrown when it's not the case, just to avoid passing
and extra parameter (ks name) to the `validate_options`
function, as I find the longer message to be a bit redundant (the driver will
receive info which KS modification failed).
The tests that have been commented out in the previous commit have been
restored.
The check for `replication_factor` tag in
`network_topology_strategy::validate_options` is redundant for 2 reasons:
- before we reach this part of the code, the `replication_factor` tag
is replaced with specific DC names
- we actually do allow for `replication_factor` tag in
NetworkTopologyStrategy for keyspaces that have tablets disabled.
This code is unreachable, hence changing it to an internal error, which
means this situation should never occur.
The place that unrolls `replication_factor` tag checked for presence of
this tag ignoring the case, which lead to an unexpected behaviour:
- `replication_factor` tag (note the lowercase) was unrolled, as
explained above,
- the same tag but written in any other case resulted in throwing a vague
message: "replication_factor is an option for SimpleStrategy, not
NetworkTopologyStrategy".
So we're changing this validation to accept and unroll only the
lowercase version of this tag. We can't ignore the case here, as this
tag is present inside a json, and json is case-sensitive, even though the
CQL itself is case insensitive.
Added a test that passes for both scylla and cassandra.
Fixes: #15336
task_stats contains short info about a task. To get a list of task_stats
in the module, one needs to request /task_manager/list_module_tasks/{module}.
To make identification and navigation between tasks easier, extend
task_stats to contain shard, start_time, and end_time.
Closesscylladb/scylladb#22351
tablet_repair_task_impl is run as a part of tablet repair. Make it
a child of tablet repair virtual task.
tablet_repair_task_impl started by /storage_service/repair_async API
(vnode repair) does not have a parent, as it is the top-level task
in that case.
No backport needed; new functionality
Closesscylladb/scylladb#22372
* github.com:scylladb/scylladb:
test: add test to check tablet repair child
service: add child for tablet repair virtual task
Currently, when the tablet repair is started, info regarding
the operation is kept in the system.tablets. The new tablet states
are reflected in memory after load_topology_state is called.
Before that, the data in the table and the memory aren't consistent.
To check the supported operations, tablet_virtual_task uses in-memory
tablet_metadata. Hence, it may not see the operation, even though
its info is already kept in system.tablets table.
Run read barrier in tablet_virtual_task::contains to ensure it will
see the latest data. Add a test to check it.
Fixes: #21975.
Closesscylladb/scylladb#21995
This was originally an attempt to reduce the compile time of this
translation unit, but apparently it doesn't work. Still, it has
the effect of converting stack traces that say "set_storage_service"
and refer to some lambda to stack traces that refer to the operation
being performed, so it's a net positive.
To faciliate the change, we introduce new functions rest_bind(),
similar to (and in fact wrapping) std::bind_front(), that capture
references like the lambdas did originally. We can't use
std::bind_front directly since the call to
seastar::httpd::path_description::set() cannot be disambiguated
after the function is obscured by the template returned by
std::bind_front. The new function rest_bind() has constraints
to understand which overload is in use.
Closesscylladb/scylladb#22526
This PR enhances the internode_compression configuration option in two ways:
1. Add validation for option values
Previously, we silently defaulted to 'none' when given invalid values. Now we
explicitly validate against the three supported values (all, dc, none) and
reject invalid inputs. This provides better error messages when users
misconfigure the option.
2. Fix documentation rendering
The help text for this option previously used C++ escape sequences which
rendered incorrectly in Sphinx-generated HTML. We now use bullet points with
'*' prefix to list the available values, matching our documentation style
for other config options. This ensures consistent rendering in both CLI
and HTML outputs.
Note: The current documentation format puts type/default/liveness information
in the same bullet list as option values. This affects other config options
as well and will need to be addressed in a separate change.
---
this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport.
Closesscylladb/scylladb#22548
* github.com:scylladb/scylladb:
config: validate internode_compression option values
config: start available options with '*'
Bug https://bugs.python.org/issue26789 is resolved in python 3.10.
The frozen tool chain uses python 3.12. Since this is a supported and
recommended way for work environment, removing workaround and bumping
requirements for a newer python version.
Closesscylladb/scylladb#22627
Following the work done in ed4bfad5c3, the action is failing with the
following error:
```
Error: Input required and not supplied: token
```
It is due ot missing permissions in the workflow, adding it
Closesscylladb/scylladb#22630
During topology state application peers table may be updated with the
new ip->id mapping. The update is not atomic: it adds new mapping and
then removes the old one. If we call get_host_id_to_ip_map while this is
happening it may trigger an internal error there. This is a regression
since ef929c5def. Before that commit the
code read the peers table only once before starting the update loop.
This patch restores the behaviour.
Fixes: scylladb/scylladb#22578
tablet_repair_task_impl is run as a part of tablet repair. Make it
a child of tablet repair virtual task.
tablet_repair_task_impl started by /storage_service/repair_async API
(vnode repair) does not have a parent, as it is the top-level task
in that case.
* seastar 71036ebcc0...5b95d1d798 (3):
> rpc stream: do not abort stream queue if stream connection was closed without error
> resource: fallback to sysconf when failed to detect memory size from hwloc
> Merge 'scheduling_group: improve scheduling group creation exception safety' from Michael Litvak
scylla-gdb.py adjusted for scheduling_group_specific data structure
changes in Seastar. As part of that, a gratuitous dereference of
std::unique_ptr, which fails for std::unique_ptr<void*, ...>, was
removed.
The test expects and asserts that after wait_for_view is completed we
read the view_build_status table and get a row for each node and view.
But this is wrong because wait_for_view may have read the table on one
node, and then we query the table on a different node that didn't insert
all the rows yet, so the assert could fail.
To fix it we change the test to retry and check that eventually all
expected rows are found and then eventually removed on the same host.
Fixesscylladb/scylladb#22547Closesscylladb/scylladb#22585
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.
A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.
The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.
To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.
Fixesscylladb/scylladb#21829Closesscylladb/scylladb#22493
Add two cqlpy tests that reproduce a bug where a secondary index query
returns more rows than the specified limit. This occurs when the indexed
column is a partition key column or the first clustering key column,
the query result spans multiple partitions, and the last partition
causes the limit to be exceeded.
`test/cqlpy/run --release ...` shows that the tests fail for Scylla
versions all the way back to 4.4.0. Older Scylla versions fail with a
syntax error in CQL query which suggests some incompatibility in the
CQL protocol. That said, this bug is not a regression.
The tests pass in Cassandra 5.0.2.
Refs #22158.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#22513
std::any_of was included by C++11, and boost::algorithm::any_of() is
provided by Boost for users stuck in the pre-C++11 era. in our case,
we've moved into C++23, where the ranges variant of this algorithm
is available.
in order to reduce the header dependency, let's switch to
`std::ranges::any_of()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22503
Materialized views with tablets are not stable yet, but we want
them available as an experimental feature, mainly for teseting.
The feature was added in scylladb/scylladb#21833,
but currently it has no effect. All tests have been updated to use the
feature, so we should finally make it work.
This patch prevents users from creating materialized views in keyspaces
using tablets when the VIEWS_WITH_TABLETS feature is not enabled - such
requests will now get rejected.
Fixesscylladb/scylladb#21832Closesscylladb/scylladb#22217
This commit addresses issue #21825, where invalid PERCENTILE values for
the `speculative_retry` setting were not properly handled, causing potential
server crashes. The valid range for PERCENTILE is between 0 and 100, as defined
in the documentation for speculative retry options, where values above 100 or
below 0 are invalid and should be rejected.
The added validation ensures that such invalid values are rejected with a clear
error message, improving system stability and user experience.
Fixes#21825Closesscylladb/scylladb#21879
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed
Closesscylladb/scylladb#22446
This patch adds an Alternator test for the case of UpdateItem attempting
to insert in invalid B (bytes) value into an item. Values of type B
use base64 encoding, and an attempt to insert a value which isn't
valid base64 should be rejected, and this is what this test verifies.
The new tests reproduce issue #17539, which claimed we have a bug in
this area. However, test/alternator/run with the "--release" option
shows that this bug existed in Scylla 5.2, but but fixed long ago, in
5.3 and doesn't exist in master. But we never had a regression test this
issue, so now we do.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22029
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map
We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
is a multiple of the number of racks and the minimum number of nodes
per rack in the dc is greater than or equal to rf / nr_racks.
In this case (that includes the single rack case), all racks would
have the same number of replicas, so we first filter all replicas
by dc and rack, retaining their ordinality in the process, and
finally, we pair between the base replicas and view replicas,
that are in the same rack, using their original order in the
tablet-map replica set.
For example, nr_racks=2, rf=4:
base_replicas = { N00, N01, N10, N11 }
view_replicas = { N11, N12, N01, N02 }
pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
Note that we don't optimize for self-pairing if it breaks pairing ordinality.
- Complex rack-aware pairing: when the replication factor is not
a multiple of nr_racks. In this case, we attempt best-match
pairing in all racks, using the minimum number of base or view replicas
in each rack (given their global ordinality), while pairing all the other
replicas, across racks, sorted by their ordinality.
For example, nr_racks=4, rf=3:
base_replicas = { N00, N10, N20 }
view_replicas = { N11, N21, N31 }
pairing would be: { N00, N31 }\*, { N10, N11 }, { N20, N21 }
\* cross-rack pair
If we'd simply stable-sort both base and view replicas by rack,
we might end up with much worse pairing across racks:
{ N00, N11 }\*, { N10, N21 }\*, { N20, N31 }\*
\* cross-rack pair
Fixesscylladb/scylladb#17147
* This is an improvement so no backport is required
Closesscylladb/scylladb#21453
* github.com:scylladb/scylladb:
network_topology_strategy_test: add tablets rack_aware_view_pairing tests
view: get_view_natural_endpoint: implement rack-aware pairing for tablets
view: get_view_natural_endpoint: handle case when there are too few view replicas
view: get_view_natural_endpoint: track replica locator::nodes
locator: topology: consult local_dc_rack if node not found by host_id
locator: node: add dc and rack getters
feature_service: add tablet_rack_aware_view_pairing feature
view: get_view_natural_endpoint: refactor predicate function
view: get_view_natural_endpoint: clarify documentation
view: mutate_MV: optimize remote_endpoints filtering check
view: mutate_MV: lookup base and view erms synchronously
view: mutate_MV: calculate keyspace-dependent flags once
When a replica get a write request it performs get_schema_for_write,
which waits until the schema is synced. However, database::add_column_family
marks a schema as synced before the table is added. Hence, the write may
see the schema as synced, but hit no_such_column_family as the table
hasn't been added yet.
Mark schema as synced after the table is added to database::_tables_metadata.
Fixes: #22347.
Closesscylladb/scylladb#22348
If start_time/end_time is unspecified for a task, task_manager API
returns epoch. Nodetool prints the value in task status.
Fix nodetool tasks commands to print empty string for start_time/end_time
if it isn't specified.
Modify nodetool tasks status docs to show empty end_time.
Fixes: #22373.
Closesscylladb/scylladb#22370
Fixes#22401
In the fix for scylladb/scylla-enterprise#892, the extraction and check for sstable component encryption mask was copied
to a subroutine for description purposes, but a very important 1 << <value> shift was somehow
left on the floor.
Without this, the check for whether we actually contain a component encrypted can be wholly
broken for some components.
Closesscylladb/scylladb#22398
This change:
- Remove code that prevented audit from starting if audit_categories,
audit_tables, and audit_keyspaces are not configured
- Set liveness::LiveUpdate for audit_categories, audit_tables,
and audit_keyspaces
- Keep const reference to db::config in audit, so current config values
can be obtained by audit implementation
- Implement function audit::update_config to parse given string, update
audit datastructures when needed, and log the changes.
- Add observers to call audit::update_config when categories,
tables, or keyspaces configuration changes
New functionality, so no backport needed.
Fixes https://github.com/scylladb/scylla-enterprise/issues/1789Closesscylladb/scylladb#22449
* github.com:scylladb/scylladb:
audit: make categories, tables, and keyspaces liveupdatable
audit: move static parsing functions above audit constructors
audit: move statement_category to string conversion to static function
audit: start audit even with empty categories/tables/keyspaces
Currently, when the status of a task is queried and the task is already finished,
it gets unregistered. Getting the status shouldn't be a one-time operation.
Stop removing the task after its status is queried. Adjust tests not to rely
on this behavior. Add task_manager/drain API and nodetool tasks drain
command to remove finished tasks in the module.
Fixes: https://github.com/scylladb/scylladb/issues/21388.
It's a fix to task_manager API, should be backported to all branches
Closesscylladb/scylladb#22310
* github.com:scylladb/scylladb:
api: task_manager: do not unregister tasks on get_status
api: task_manager: add /task_manager/drain
Repair service is started after storage service, while storage service needs to reference repair one for its needs. Recently it was noticed, that this reverse order may cause troubles and was fixed with the help of an extra gate. That's not nice and makes the start-stop mess even worse. The correct fix is to fix the order both services start/stop in.
Closesscylladb/scylladb#22368
* github.com:scylladb/scylladb:
Revert "repair: add repair_service gate"
main: Start repair before storage service
repair: Check for sharded<view-builder> when constructing row_level_repair
This patch adds extensive functional tests for the DynamoDB multi-item
transactions feature - the TransactWriteItems and TransactGetItems
requests. We add 43 test functions, spanning more than 1000 lines of code,
covering the different parameters and corner cases of these requests.
Because we don't support the transaction feature in Alternator yet (this
is issue #5064), all of these tests fail on Alternator but all of them
were tested to pass on DynamoDB. So all new tests are marked "xfail".
These tests will be handy for whoever will implement this feature as
an acceptance test, and can also be useful for whoever will just want to
understand this feature better - the tests are short and simple and
heavily commented.
Note that these tests only check the correct functionality of individual
calls of these requests - these tests cannot and do not check the
consistency or isolation guarantees of concurrent invocations of
several requests. Such tests would require a different test framework,
such as the one requested in issue #6350, and are therefore not part of
this patch.
Note that this patch includes ONLY tests, and does not mean that an
implementation of the feature will soon follow. In fact, nobody is
currently working on implementing this feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22239
Introduce `defer_verbose_shutdown` in `cql_test_env` which logs
a message before and after shutting down a service, distinguishing
between success and failure.
The function is similar to the one in `main` but skips special error
handling logic applicable only to the main Scylla binary. The purpose
of the `cql_test_env` version of this function is only more verbose
logging. If necessary it can be extended in the future with additional
logic.
I estimated the impact on the size of produced log files using
`cdc_test` as an example:
```
$ build/dev/test/boost/combined_tests --run_test=cdc_test -- --smp=2 \
>logfile 2>&1
$ du -b logfile
```
the result before this commit: 1964064 bytes, after: 2196432 bytes,
so estimated ~12% increase of log file size for boost tests that use
`cql_test_env`, assuming that the number of logs printed by each test is
similar to the logs printed by `cdc_test` (but I believe `cdc_test` is
one of the less verbose tests so this is an overestimate).
The motivation for this change is easier debugging of shutdown issues.
When investigating scylladb/scylladb#21983, where an exception is
thrown somewhere during the shutdown procedure, I found it hard to
pinpoint the service from which the exception originates. This change
will make it easier to debug issues like that by wrapping shutdown of
each service in a pair of messages logged when shutdown starts and when
it finishes (including when it fails). We should get more details on
this issue when it reproduces again in CI after this commit is merged
into `master`. (I failed to reproduce it locally with 1000 runs.)
Ref scylladb/scylladb#21983Closesscylladb/scylladb#22566
Fixes#22236
If reading a file and not stopping on block bounds returned by `size()`, we could allow reading from (_file_size+<1-15>) (if crossing block boundary) and try to decrypt this buffer (last one).
Simplest example:
Actual data size: 4095
Physical file size: 4095 + key block size (typically 16)
Read from 4096: -> 15 bytes (padding) -> transform return `_file_size` - `read offset` -> wraparound -> rather larger number than we expected (not to mention the data in question is junk/zero).
Check on last block in `transform` would wrap around size due to us being >= file size (l).
Just do an early bounds check and return zero if we're past the actual data limit.
Closesscylladb/scylladb#22395
* github.com:scylladb/scylladb:
encrypted_file_test: Test reads beyond decrypted file length
encrypted_file_impl: Check for reads on or past actual file length in transform
This commit adds the information about SStable version support in 2025.1
by replacing "2022.2" with "2022.2 and above".
In addition, this commit removes information about versions that are
no longer supported.
Fixes https://github.com/scylladb/scylladb/issues/22485Closesscylladb/scylladb#22486
test_coro_frame is flaky, as if
`service::topology_coordinator::run() [clone .resume]` wasn't running
on the shard. But it's supposed to.
Perhaps this is a bug in `find_vptrs()`?
This patch asks `scylla find` for a second opinion, and also prints
all `find_vptrs()`, to see if it's the only coroutine missing
from there.
Closesscylladb/scylladb#22534
since C++20, std::string and std::string_view started providing
`ends_with()` member function, the same applies to `seastar::sstring`,
so there is no need to use `boost::ends_with()` anymore.
in this change, we switch from `boost::ends_with()` to the member
functions variant to
- improve the readability
- reduce the header dependency
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22502
* seastar 18221366...71036ebc (2):
> Merge 'fair_queue: make the fair_group token grabbing discipline more fair' from Michał Chojnowski
apps/io_tester: add some test cases for the IO scheduler
test: in fair_queue_test, ensure that tokens are only replenished by test_env
fair_queue: track the total capacity of queued requests
fair_queue: make the fair_group token grabbing discipline more fair
> scheduling: auto-detect scheduling group key rename() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#22504
Previously, the internode_compression option silently defaulted to 'none'
for any unrecognized value instead of validating input. It only compared
against 'all' and 'dc', making it error-prone.
Add explicit validation for the three supported values:
- all
- dc
- none
This ensures invalid values are rejected both in command line and YAML
configuration, providing better error messages to users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
use '*' prefix for config option values instead of escape sequences
The custom Sphinx extension that generates documentation from config.cc
help messages has issues with C++ escape sequences. For example,
"\tall: All traffic" renders incorrectly as "tall: All traffic" in HTML
output.
Instead of using escape sequences, switch to bullet-point style with '*'
prefix which works better in both CLI and HTML rendering. This matches
our existing documentation style for available option values in other
configs.
Note: This change puts type/default/liveness info in the same bullet
list as option values. This limitation affects other similar config
options and will need to be addressed comprehensively in a future
change.
Refs scylladb/scylladb#22423
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add missing vector type documentation including: definition of vector,
adjustment of term definition, JSON encoding, Lua and cql3 type
mapping, vector dimension limit, and keyword specification.
Add cql_vector_test which tests the basic functionalities of
the vector type using CQL.
Add vectors_test which tests if descending ordering of vector
is supported.
This change has been introduced to enable CQL drivers to recognize
vector type in query results.
The encoding has been imported from Apache Cassandra implementation
to match Cassandra's and latest drivers' behaviour.
Co-authored-by: Dawid Pawlik <501149991dp@gmail.com>
Like mentioned in the previous commit, this changes introduce usage
of vector style type and adjusts the functions using list style type
to distinguish vectors from lists.
Rename collection constructor style list to list_or_vector.
Motivation for this changes is to provide a distinguishable interface
for vector type expressions.
The square bracket literal is ambigious for lists and vectors,
so that we need to perform a distinction not using CQL layer.
At first we should use the collection constructor to manage
both lists and vectors (although a vector is not a collection).
Later during preparation of expressions we should be able to get
to know the exact type using given receiver (column specification).
Knowing the type of expression we may use their respective style type
(in this case the vector style type being introduced),
which would make the implementation more precise and allow us to
evaluate the expressions properly.
This commit introduces vector style type and functions making use of it.
However vector style type is not yet used anywhere,
the next commit should adjust collection constructor and make use
of the new vector style type and it's features.
These tests check serialization and deserialization (including JSON),
basic inserts and selects, aggregate functions, element validation,
vector usage in user defined types and functions.
test_vector_between_user_types is a translated Apache Cassandra test
to check if it is handled properly internally.
This change is introduced due to lack of support for vector class name,
used by type_parser to create data_type based on given class name
(especially compound class name with inner types or other parameters).
Add function that parses vector type parameters from a class name.
Introduce vector_type CQL syntax: VECTOR<`cql_type`, `integer`>.
The parameters are respectively a type of elements of the vector
and the vector's dimension (number of elements).
Co-authored-by: Jan Łakomy <janpiotrlakomy@gmail.com>
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.
Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.
What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.
To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.
Fixesscylladb/scylladb#21227Closesscylladb/scylladb#22093
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated. The series factors out peers table update code from
sync_raft_topology_nodes() and calls it on topology and ip address
updates. As a side effect it fixes#22293 since now topology loading
does not require IP do be present, so the assert that is triggered in
this bug is removed.
Fixes: scylladb/scylladb#22293Closesscylladb/scylladb#22519
* github.com:scylladb/scylladb:
topology coordinator: do not update topology on address change
topology coordinator: split out the peer table update functionality from raft state application
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.
Skip the keyspace repair if no_such_keyspace is thrown during preparations.
Fixes: #22073.
Requires backport to 6.1 and 6.2 as they contain the bug
Closesscylladb/scylladb#22473
* github.com:scylladb/scylladb:
test: add test to check if repair handles no_such_keyspace
repair: handle keyspace dropped
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).
This change removes SSTable components from the snapshot directory once
they are successfully uploaded to the target location. This prevents
re-uploading the same files and reduces disk usage.
This change only "Refs" https://github.com/scylladb/scylladb/issues/20655, because, we can further optimize
the backup process, consider:
- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.
Fixes https://github.com/scylladb/scylladb/issues/21799
Refs https://github.com/scylladb/scylladb/issues/20655
---
the backup API is not used in production yet, so no need to backport.
Closesscylladb/scylladb#22285
* github.com:scylladb/scylladb:
backup_task: remove a component once it is uploaded
backup_task: extract component upload logic into dedicated function
snapshot-ctl: change snapshot_ctl::run_snapshot_modify_operation() to regular func
ExternalProject automatically creates BINARY_DIR for Seastar, but generator
expressions are not supported in this setting. This caused CMake to create
an unused "build/$<CONFIG>/seastar" directory.
Instead, define a dedicated variable matching configure.py's naming and use
it in supported options like BUILD_COMMAND. This:
- Creates build files in the standard "Seastar-prefix/src/Seastar-build"
directory instead of "build/$<CONFIG>/seastar". see
https://cmake.org/cmake/help/latest/module/ExternalProject.html#directory-options
- Makes it clearer that the variable should match configure.py settings
No functional changes to the Seastar build process - purely a cleanup to
reduce confusion when inspecting the build directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22437
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`,
use `using seastar::future` to include `future` in the global namespace,
this makes `readers/mutation_reader.hh` a header exposing `future<>`,
but this is not a good practice, because, unlike `seastarx.hh` or
`seastar/core/future.hh`, `reader/mutation_reader.hh` is not
responsible for exposing seastar declarations. so, we trade the
using statement for `#include "seastarx.hh"` in that file to decouple
the source files including it from this header because of this statement.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22439
Refs https://github.com/scylladb/seastar/issues/2513
Reloadable certificates use inotify instances. On a loaded test (CI) server, we've seen cases where we literally run out of capacity. This patch uses the extended callback and reload capability of seastar TLS to only create actual reloadable certificate objects on shard 0 for our main TLS points (encryption only does TLS on shard 0 already).
Closesscylladb/scylladb#22425
* github.com:scylladb/scylladb:
alternator: Make server peering sharded and reuse reloadable certs
messaging_service: Share reloadability of certificates across shards
redis/controller: Reuse shard 0 reloadable certificates for all shards
controller: Reuse shard 0 reloadable certificates for all shards
generic_server: Allow sharing reloadability of certificates across shards
Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in `topology_coordinator::handle_global_request()` while `topology::tstate` remains empty. This creates problems because `topology::is_busy()` uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing.
This change introduces a new topology transition `topology::transition_state::truncate_table` and moves the truncate logic to a new method `topology_coordinator::handle_truncate_table()`. This method is now called as a handler of the `truncate_table` transition state instead of a handler of the `trunacate_table` global topology request.
This PR is a bugfix for truncate with tables and needs to be backported to 2025.1
Closesscylladb/scylladb#22452
* github.com:scylladb/scylladb:
truncate: trigger truncate logic from transition state instead of global request handler
truncate: add truncate_table transition state
The flush api could not detect if the node is down and fail the flush
before the timeout. This patch detects if there is down node and skip
the flush if so, since the flush will fail after the timeout in this
case anyway.
The slowness due to the flush timeout in
compaction_test.py::TestCompaction::test_delete_tombstone_gc_node_down
is fixed with this patch.
Fixes#22413Closesscylladb/scylladb#22445
Adds an optional callback to "listen", returning the shard local object
instance. If provided, instead of creating a "full" reloadable cerificate
object, only do so on shard 0, and use callback to reload other shards
"manually".
In the spirit of using standard-library types, instead of boost ones
where possible.
Although a disk type, it is serialized/deserialized with custom code, so
the change shouldn't cause any changes in the disk representation.
It is almost always a bad idea to run removenode force. This means a
node is removed without the remaining nodes to stream data that they
should own after the removal. This will make the cluster into a worse
state than a node being down.
One can use one of the following procedure instead:
1) Fix the dead node and move it back to the cluster
2) Run replace ops to replace the dead node
3) Run removenode ops again
We have seen misuse of nodetool removenode force by users again and
again. This patch rejects it so it can not be misused anymore.
Fixesscylladb/scylladb#15833Closesscylladb/scylladb#15834
This commit removes the information about FIPS out of the '.. only:: enterprise' directive.
As a result, the information will now show in the doc in the ScyllaDB repo
(previously, the directive included the note in the Entrprise docs only).
Refs https://github.com/scylladb/scylla-enterprise/issues/5020Closesscylladb/scylladb#22374
Fixes#21993
Removes configuration_encryptor mention from docs.
The tool itself (java) is not included in the main branch
java tools, thus need not remove from there. Only the words.
Closesscylladb/scylladb#22427
Add a test to reproduce a bug in the read DMA API of
`encrypted_file_impl` (the file implementation for Encryption-at-Rest).
The test creates an encrypted file that contains padding, and then
attempts to read from an offset within the padding area. Although this
offset is invalid on the decrypted file, the `encrypted_file_impl` makes
no checks and proceeds with the decryption of padding data, which
eventually leads to bogus results.
Refs #22236.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8f936b2cbc)
Fixes#22236
If reading a file and not stopping on block bounds returned by `size()`, we could
allow reading from (_file_size+1-15) (block boundary) and try to decrypt this
buffer (last one).
Check on last block in `transform` would wrap around size due to us being >=
file size (l).
Simplest example:
Actual data size: 4095
Physical file size: 4095 + key block size (typically 16)
Read from 4096: -> 15 bytes (padding) -> transform return _file_size - read offset
-> wraparound -> rather larger number than we expected
(not to mention the data in question is junk/zero).
Just do an early bounds check and return zero if we're past the actual data limit.
v2:
* Moved check to a min expression instead
* Added lengthy comment
* Added unit test
v3:
* Fixed read_dma_bulk handling of short, unaligned read
* Added test for unaligned read
v4:
* Added another unaligned test case
`tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups.
The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that
`tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but
`tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE
Fixes#22431
This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1
Closesscylladb/scylladb#22330
* github.com:scylladb/scylladb:
test: add reproducer and test for fix to split ready CG creation
table: run set_split_mode() on all storage groups during all_storage_groups_split()
This commit adds the OS support information for version 2025.1.
In addition, the OS support page is reorganized so that:
- The content is moved from the include page _common/os-support-info.rst
to the regular os-support.rst page. The include page was necessary
to document different support for OSS and Enterprise versions, so
we don't need it anymore.
- I skipped the entries for versions that won't be supported when 2025.1
is released: 6.1 and 2023.1.
- I moved the definition of "supported" to the end of the page for better
readability.
- I've renamed the index entry to "OS Support" to be shorter on the left menu.
Fixes https://github.com/scylladb/scylladb/issues/22474Closesscylladb/scylladb#22476
This series exposes a Clock template parameter for loading_cache so that the test could use
the manual_clock rather than the lowres_clock, since relying on the latter is flaky.
In addition, the test load function is simplified to sleep some small random time and co_return the expected string,
rather than reading it from a real file, since the latter's timing might also be flaky, and it out-of-scope for this test.
Fixes#20322
* The test was flaky forever, so backport is required for all live versions.
Closesscylladb/scylladb#22064
* github.com:scylladb/scylladb:
tests: loading_cache_test: use manual_clock
utils: loading_cache: make clock_type a template parameter
test: loading_cache_test: use function-scope loader
test: loading_cache_test: simlute loader using sleep
test: lib: eventually: add sleep function param
test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
Most of the code from `recognized_options` is either incorrect or lacks
any implementation, for example:
- comments for Everywhere and Local strategies are contradictory, first
says to allow all options, second says that the strategy doesn't accept
any options, even though both functions have the same implementation,
- for Local & Everywhere strategies the same logic is repeated in
`validate_options` member functions, i.e. this function does nothing,
- for NetworkTopology this function returns DC names and tablet options, but tablet
options are empty; OTOH this strategy also accepts 'replication_factor'
tag, which was ommitted,
- for SimpleStrategy this function returns `replication_factor`, but this is also validated
in `validate_options` function called just before the removed
function.
All of it makes `validate_replication_strategy` work incorrectly.
That being said, 3 tests fail because of this logic's removal, so it did
something after all. The failing tests are commented out, so that the CI
passes, and will be restored in the next commit(s).
This change:
- Set liveness::LiveUpdate for audit_categories, audit_tables,
and audit_keyspaces
- Keep const reference to db::config in audit, so current config values
can be obtained by audit implementation
- Implement function audit::update_config to parse given string, update
audit datastructures when needed, and log the changes.
- Add observers to call audit::update_config when categories,
tables, or keyspaces configuration changes
Fixesscylladb/scylla-enterprise#1789
This change:
- Swap static function and audit constructors in audit.cc
This is a preparatory commit for enabling liveupdate of audit
categories, tables, and keyspaces. It allows future use of static
parsing functions in audit constructor.
This change:
- Move audit_info::category_string to a new static function
- Start using the new function in audit_info::category_string
This is a preparatory commit for enabling liveupdate of audit
categories, tables, and keyspaces. The newly created static function
will be required for proper logging of audit categories.
This change:
- Remove code that prevented audit from starting if audit_categories,
audit_tables, and audit_keyspaces are not configured
This is a preparatory commit for enabling liveupdate of audit
categories, tables, and keyspaces. Without this change, audit is
not started for particular categories/tables/keyspaces setting and
it is unwanted behavior if customer can change audit configuration via
liveupdate.
This commit has performance implications if audit sink is set (meaning
"audit"="table" or "audit"="syslog" in the config) but categories,
tables, and keyspaces are not set to audit anything. Before this commit,
audit was not started, so some operations (like creating audit_info or
lookup in empty collections) were omitted.
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.
The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.
Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.
Skip the keyspace repair if no_such_keyspace is thrown during preparations.
- To make Scylla able to run in FIPS-compliant system, add .hmac files for
crypto libraries on relocatable/rpm/deb packages.
- Currently we just write hmac value on *.hmac files, but there is new
.hmac file format something like this:
```
[global]
format-version = 1
[lib.xxx.so.yy]
path = /lib64/libxxx.so.yy
hmac = <hmac>
```
Seems like GnuTLS rejects fips selftest on .libgnutls.so.30.hmac when
file format is older one.
Since we need to absolute path on "path" directive, we need to generate
.libgnutls.so.30.hmac in older format on create-relocatable-script.py,
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closesscylladb/scylladb#22384
The vector is a fixed-length array of non-null
specified type elements.
Implement serialization, deserialization, comparison,
JSON and Lua support, and other functionalities.
Co-authored-by: Dawid Pawlik <501149991dp@gmail.com>
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated, so call a function that does peers table update only.
Raft topology state application does two things: re-creates token metadata
and updates peers table if needed. The code for both task is intermixed
now. The patch separates it into separate functions. Will be needed in
the next patch.
There's such a reference on storage_service itself, it can use this->_sys_dist_ks instead thus making its API (both internal and external) a bit simpler.
Closesscylladb/scylladb#22483
* github.com:scylladb/scylladb:
storage_service: Drop sys_dist_ks argument from track_upgrade_progress_to_topology_coordinator()
storage_service: Drop sys_dist_ks argument from raft_state_monitor_fiber()
storage_service: Drop sys_dist_ks argument from join_topology()
storage_service: Drop sys_dist_ks argument from join_cluster()
these misspellings were identified by codespell. let's fix them.
one of them is a part of a user visble string.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22443
Specialize config_from_string() for sstring to resolve lexical_cast
stream state parsing limitation. This enables correct handling of empty
string configurations, such as setting an empty value in CQL:
```cql
UPDATE system.config SET value='' WHERE
name='allowed_repair_based_node_ops';
```
Previous implementation using boost::lexical_cast would fail due to
EOF stream state, incorrectly rejecting valid empty string conversions.
Fixesscylladb/scylladb#22491
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22492
Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be
initialized before compressor::namespace_prefix due to undefined
global variable initialization order across translation units. This
was causing ZstdCompressor to be unregistered in release builds,
making it impossible to create tables with Zstd compression.
Replace the global namespace_prefix variable with a function that
returns the fully qualified compressor name. This ensures proper
initialization order and fixes the registration of the ZstdCompressor.
Fixesscylladb/scylladb#22444
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22451
* seastar a9bef537...18221366 (33):
> io_queue: fix static member access to comply with CWG2813
> build: add missing include in program_options.cc
> coroutine: move operator co_await(exception) into seastar::coroutine namespace
> fair_queue: Mark entry constructor explicit
> test: Add perf test to measure the "cost" of chain wakeup
> websocket: Support clients that do not specify subprotocol
> websocket: Accept plain const& to string as subprotocol
> perf_tests: Inline print_text_header() into stdout_printer
> perf_tests: Right-align numeric metrics in markdown tables
> scripts/addr2line.py: fix hanging with the new llvm-addr2line version
> Revert "rpc stream: do not abort stream queue if stream connection was closed without error"
> websocket: Convert connection::read_http_upgrade_request() to use coros
> rpc stream: do not abort stream queue if stream connection was closed without error
> linux-aio: remove cpu reduction suggestions
> gitignore: ignore directories that match "build*"
> perf_tests: make column generic
> net: replace deprecated ip::address_v4::from_string()
> file: remove deprecated file lifetime hint APIs
> semaphore: expiry_handler: tunnel exception_ptr to entry
> tests: unit: refactor expected_exception
> semaphore: return early exception before appending wait_list
> semaphore: expiry_handler: refactor exception getters
> abortable_fifo: support OnAbort callbacks accepting exception_ptr
> abort_on_expiry: fix typos in comments
> abort_on_expiry: request_abort with timed_out_error
> Add missing include in dpdk_rte.hh
> build: use path to libraries in .pc
> httpd: drop unnecessary dependencies from httpd.hh
> build: allow CMake to find Boost using package config
> print: remove deprecated print() functions
> github: s/ubuntu-latest/ubuntu-24.04/
> perf_tests: coroutinize main loop
> add perf_tests_perf
Closesscylladb/scylladb#22466
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22467
Clang 18.1 with lto gained the ability to eliminate dead stores.
Since debug::the_database is write-only as far as the compiler understands
(it is read only by gdb), all writes to it are eliminated.
Protect writes to the variable by marking it volatile.
Closesscylladb/scylladb#22454
Our CI accidentally switched to using CMake to compile scylla and it looks like CMake doesn't run the `${mode}-headers` command correctly and some missing-include in headers managed to slip in.
Compile fix, no backport needed.
Closesscylladb/scylladb#22471
* github.com:scylladb/scylladb:
test/raft/replication.hh: add missing include <fmt/std.h>
test/boost/bptree_validation.hh: add missing include <fmt/format.h>
Relying on a real-time clock like lowres_clock
can be flaky (in particular in debug mode).
Use manual_clock instead to harden the test against
timing issues.
Fixes#20322
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So the unit test can use manual_clock rather than lowres_clock
which can be flaky (in particular in debug mode).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than a global function, accessing a thread-local `load_count`.
The thread-local load_count cannot be used when multiple test
cases run in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This test isn't about reading values from file,
but rather it's about the loading_cache.
Reading from the file can sometimes take longer than
the expected refresh times, causing flakiness (see #20322).
Rather than reading a string from a real file, just
sleep a random, short time, and co_return the string.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This reverts commit 32ab58cdea.
Now repair service starts before and stops after storage server, so the
problem described in the commit is no longer relevant.
The latter service uses repair, but not the vice-versa, so the correct
(de)initialization order should be the same.
refs: #2737
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently initialization order of repair and view-builder is not
correct, so there are several places in repair code that check for v.b.
to be initialized before doing anything. There's one more place that
needs that care -- the construction of row_level_repair object.
The class instantiates helper objects that reply on view_builder to be
fully initialized and is itself created by many other task types from
repair code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR contains the missing part of a fix for scylladb/scylla-enterprise#4912 which was omitted during migration of workload prioritization to the source available repository. Even though the regression test for it was ported, it was silently made ineffective by a different fix (scylladb/scylla-enterprise#4764), so this PR also improves the test.
Fixes: scylladb/scylladb#22404
No need to backport - service levels are not yet a part of any source-available release.
Closesscylladb/scylladb#22416
* github.com:scylladb/scylladb:
test/auth_cluster: make test_service_level_metric_name_change useful
main: rename `cql_sg_stats` metrics on scheduling group rename
rather then macros.
This is a first cleanup step before adding a sleep function
parameter to support also manual_clock.
Also, add a call to BOOST_REQUIRE_EQUAL/BOOST_CHECK_EQUAL,
respectively, to make an error more visible in the test log
since those entry points print the offending values
when not equal.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
request handler
Before this change, the logic of truncate for tablets was triggered from
topology_coordinator::handle_global_request(). This was done without
using a topology transition state which remained empty throughout the
truncate handler's execution.
This change moves the truncate logic to a new method
topology_coordinator::handle_truncate_table(). This method is now called
as a handler of the truncate_table topology transition state instead of
a handler of the trunacate_table global topology request.
Truncate table for tablets is implemented as a global topology operation.
However, it does not have a transition state associated with it, and
performs the truncate logic in handle_global_request() while
topology::tstate remains empty. This creates problems because
topology::is_busy() uses transition_state to determine if the topology
state machine is busy, and will return false even though a truncate
operation is ongoing.
This change adds a new transition state: truncate_table
When the table is stopped, all compaction groups
are stopped, and as part of that, they are flushing
their memtables.
To synchronize with stop-induced flush operation,
move _pending_flushes_phaser.stop() later in table::stop(),
after all compaction groups are flushed and stopped.
This way, in table::flush, if we see that the phaser
is already closed, we know that there is nothing to flush,
otherwise we start a flush operation that would be waited
on by a parallel table::stop().
Fixes#22243
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#22339
Test the simple case of base/view pairing with replication_factor
that is a multiple of the number of racks.
As well as the complex case when simple_tablets_rack_aware_view_pairing
is not possible.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map
We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
is a multiple of the number of racks and the minimum number of nodes
per rack in the dc is greater than or equal to rf / nr_racks.
In this case (that includes the single rack case), all racks would
have the same number of replicas, so we first filter all replicas
by dc and rack, retaining their ordinality in the process, and
finally, we pair between the base replicas and view replicas,
that are in the same rack, using their original order in the
tablet-map replica set.
For example, nr_racks=2, rf=4:
base_replicas = { N00, N01, N10, N11 }
view_replicas = { N11, N12, N01, N02 }
pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
Note that we don't optimize for self-pairing if it breaks pairing ordinality.
- Complex rack-aware pairing: when the replication factor is not
a multiple of nr_racks. In this case, we attempt best-match
pairing in all racks, using the minimum number of base or view replicas
in each rack (given their global ordinality), while pairing all the other
replicas, across racks, sorted by their ordinality.
For example, nr_racks=4, rf=3:
base_replicas = { N00, N10, N20 }
view_replicas = { N11, N21, N31 }
pairing would be: { N00, N31 }*, { N10, N11 }, { N20, N21 }
* cross-rack pair
If we'd simply stable-sort both base and view replicas by rack,
we might end up with much worse pairing across racks:
{ N00, N11 }*, { N10, N21 }*, { N20, N31 }*
* cross-rack pair
Fixesscylladb/scylladb#17147
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when reducing RF, we may drop replicas from
the view before dropping replicas from the base table.
Since get_view_natural_endpoint is allowed to return
a disengaged optional if it can't find a pair for the
base replica, replcace the exiting assertion with code
handling this case, and count those events in a new
table metric: total_view_updates_failed_pairing.
Note that this does not fix the root cause for the issue
which is the unsynchronized dropping of replicas, that
should be atomic, using a single group0 transaction.
Refs scylladb/scylladb#21492
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than tracking only the replica host_id, keep
track of the locator:::node& to prepare for
rack-aware pairing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Like get_location by inet_address, there is a case, when
a node is replaced that the node cannot be found by host_id.
Currently get_location would return a reference based on the nullptr
which might cause a segfault as seen in testing. Instead,
if the host_id is of the location, revert to calling
get_location() which consults this_node or _cfg.local_dc_rack.
Otherwise, throw a runtime_error.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Simplify the function logic by calculating the predicate
function once, before scanning all base and view replicas,
rather than testing the different options in the inner loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"self-pairing" is enabled only when use_legacy_self_pairing
is enabled. That is currently unclear in the documentation
comment for this function.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we always lookup both `my_address` and *target_endpoint
in remote_endpoints. But if my_address is in remote_endpoints
in some cases the second lookup is not needed, so do it only
to decide whether to swap target_endpoint with my_address, if
found in remote_endpoints, or to remove that match, if
*target_endpoint is already pending as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although at the moment storage_service::replicate_to_all_cores
may yield between updating the base and view tables with
a new effective_replication_map, scylladb/scylladb#21781
was submitted to change that so that they are updated
atomically together.
This change prepares for the above change, and is harmless
at the moment.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All view live in the same keyspace as their base
table, so calculate the keyspace-dependent flags
once, outside the per-view update loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).
This change
- adds an optional query parameter named "move_files" to
"/storage_service/backup" API. if it is set to "true", SSTable
components are removed once they are backed up to object storage.
- conditionally removes SSTable components from the snapshot directory once
they are successfully uploaded to the target location. This prevents
re-uploading the same files and reduces disk usage.
This change only "Refs" #20655, because, we can move further optimize
the backup process, consider:
- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.
Fixes#21799
Refs #20655
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Extract upload_component() from backup_task_impl::do_backup() to improve
readability and prepare for optional post-upload cleanup. This refactoring
simplifies the main backup flow by isolating the upload logic into its own
function.
The change is motivated by an upcoming feature that will allow optional
deletion of components after successful upload, which would otherwise add
complexity to do_backup().
Refs scylladb/scylladb#21799
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of implementing `snapshot_ctl::run_snapshot_modify_operation()` as
a template function, let it accept as plain noncopyable_function
instance, and return `future<>`.
Previously, `snapshot_ctl::run_snapshot_modify_operation` was a template
function that accepted a templated functor parameter. This approach
limited its usability because callers needed to be defined in the same
translation unit as the template implementation.
however, `backup_task_impl` is defined in another translation unit,
and we intend to call `snapshot_ctl::run_snapshot_modify_operation()`
in its implementation. so in order to cater this need, there are two
options:
1. to move the definition of the template function into the header file.
but the downside is that this slows down the compilation by increaing
the size of header.
2. to change the template function to a regular function. This change
restricts the function's parameter to a specific signature. However, all
current callers already return a `future<>` object, so there's minimal
impact.
in this change, we implement the second option. this allows us to call
this function from another translation unit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This adds a reproducer for #22431
In cases where a tablet storage group manager had more than one storage
group, it was possible to create compaction groups outside the group0
guard, which could create problems with operations which should exclude
with compaction group creation.
tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode()
for each of its storage groups to create split ready compaction groups. It does
this by iterating through storage groups using std::ranges::all_of() which is
not guaranteed to iterate through the entire range, and will stop iterating on
the first occurance of the predicate (set_split_mode()) returning false.
set_split_mode() creates the split compaction groups and returns false if the
storage group's main compaction group or merging groups are not empty. This
means that in cases where the tablet storage group manager has non-empty
storage groups, we could have a situation where split compaction groups are not
created for all storage groups.
The missing split compaction groups are later created in
tablet_storage_group_manager::split_all_storage_groups() which also calls
set_split_mode(), and that is the reason why split completes successfully. The
problem is that tablet_storage_group_manager::all_storage_groups_split() runs
under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups()
does not. This can cause problems with operations which should exclude with
compaction group creation. i.e. DROP TABLE/DROP KEYSPACE
In order to reduce the dependency on external libraries, and for better integration with ranges in C++ standard library. let's use the homebrew `utils::views::unique()` before unique is accepted by the C++ standard.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#22393
* github.com:scylladb/scylladb:
cql3, test: switch from boost::adaptors::uniqued to utils::views:unique
utils: implement drop-in replacement for replacing boost::adaptors::uniqued
The log file names created in `scylla_cluster.py` by
`ScyllaClusterManager`
and files to be collected in conftest.py by `manager` should be in
sync. This patch fixes the issue, originally introduced in
scylladb/scylladb#22192Fixesscylladb/scylladb#22387
Backports: 6.1 and 6.2.
Closesscylladb/scylladb#22415
In order to reduce the dependency on external libraries, and for better
integration with ranges in C++ standard library. let's use the homebrew
`utils::views::unique()` before unique is accepted by the C++ standard.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add a custom implementation of boost::adaptors::uniqued that is compatible
with C++20 ranges library. This bridges the gap between Boost.Range and
the C++ standard library ranges until std::views::unique becomes available
in C++26. Currently, the unique view is included in
[P2214](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2760r0.html)
"A Plan for C++ Ranges Evolution", which targets C++26.
The implementation provides:
- A lazy view adaptor that presents unique consecutive elements
- No modification of source range
- Compatibility with C++20 range views and concepts
- Lighter header dependencies compared to Boost
This resolves compilation errors when piping C++20 range views to
boost::adaptors::uniqued, which fails due to concept requirements
mismatch. For example:
```c++
auto range = std::views::take(n) | boost::adaptors::uniqued; // fails
```
This change also offers us a lightweight solution in terms of smaller
header dependency.
While std::ranges::unique exists in C++23, it's an eager algorithm that
modifies the source range in-place, unlike boost::adaptors::uniqued which
is a lazy view. The proposed std::views::unique (P2214) targeting C++26
would provide this functionality, but is not yet available.
This implementation serves as an interim solution for filtering consecutive
duplicate elements using range views until std::views::unique is
standardized.
For more details on the differences between `std::ranges::unique` and
`boost::adaptors::uniqued`:
- boost::adaptors::uniqued is a view adaptor that creates a lazy view over the original range. It:
* Doesn't modify the source range
* Returns a view that presents unique consecutive elements
* Is non-destructive and lazy-evaluated
* Can be composed with other views
- std::ranges::unique is an algorithm that:
* Modifies the source range in-place
* Removes consecutive duplicates by shifting elements
* Returns an iterator to the new logical end
* Cannot be used as a view or composed with other range adaptors
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.
To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.
In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.
Fixes https://github.com/scylladb/scylladb/issues/21354
This issue is a bug, so adding backport labels 6.1 and 6.2
Closesscylladb/scylladb#21862
* github.com:scylladb/scylladb:
test: add test for schema registry maintaining base info for views
schema_registry: avoid setting base info when getting the schema from registry
schema_registry: update cached base schemas when updating a view
schema_registry: cache base schemas for views
db: set base info before adding schema to registry
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.
Yet, there's is nothing in the code that prevents users
from creating such tables.
This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage`, set to
`false` by default, that require users to opt-in
in order to create new tables WITH COMPACT STORAGE.
Refs scylladb/scylladb#12263, scylladb/scylladb#16375
* Since this guardrail is an enhancement, no backport is needed
Closesscylladb/scylladb#16403
* github.com:scylladb/scylladb:
docs: ddl: document the deprecation of compact tables
test: enable_create_table_with_compact_storage for tests that need it
config: add enable_create_table_with_compact_storage
The test test_service_level_metric_name_change was originally introduced
to serve as a regression test for scylladb/scylla-enterprise#4912.
Before the fix, some per-scheduling-group metrics would not get adjusted
when the scheduling group gets renamed (which does happen for SL-managed
scheduling groups) and it would be possible to attempt to register
metrics with the same set of labels, resulting in an error.
However, in scylladb/scylla-enterprise#4764, another bug was fixed which
affected the test. Before a service level is created, a "test"
scheduling group can be created by service level controller if it is
unsure whether it is allowed to create more scheduling groups or not. If
creation of the scheduling group succeeds, it is put into the pool of
scheduling groups to be reused when a new service level is created.
Therefore, the node handling CREATE SERVICE LEVEL would always use the
scheduling group that was originally created for the sake of the test as
a SG for the new service level.
All of the above is intentional and was actually fixed by the
aforementioned issue. However, the test scheduling groups would always
get unique names and, therefore, the error would no longer reproduce.
However, the faulty logic that ran previously and caused the bug still
runs - when a node updates its service levels cache on group0 reload.
The test previously used only one node. Fix it by starting two nodes
instead of one at the beginning of the test and by serving all service
level commands to the first node - were the issue not fixed, the error
would get triggered on the second node.
This commit contains the part of a fix
for scylladb/scylla-enterprise#4912 that was accidentally omitted when
workload prioritization were ported from enterprise to scylladb.git
repo. Without it, the metrics created by `cql_sg_stats` would not be
updated, leading to wrong scheduling group names being used in metrics'
names, and could lead to "double metric registration errors" in some
unlucky circumstances where a scheduling group would be created,
destroyed and then created again.
Fixes: scylladb/scylladb#22404
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.
Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.
Fixes#17507
New feature. No backport is needed.
Closesscylladb/scylladb#21896
* github.com:scylladb/scylladb:
repair: Stop using rpc to update repair time for repairs scheduled by scheduler
repair: Wire repair_time in system.tablets for tombstone gc
test: Disable flush_cache_time for two tablet repair tests
test: Introduce guarantee_repair_time_next_second helper
repair: Return repair time for repair_service::repair_tablet
service: Add tablet_operation.hh
Update configure.py to use wasm32-wasip1 as an alternative to wasm32-wasi,
matching the behavior previously implemented for CMake builds in 8d7786cb0e.
This ensures consistent WASI target handling across both build systems.
Refs #20878
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22386
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22034
The content of the header file noexcept_traits.hh is unused throughout ScyllaDB's code base.
As part of a greater effort to cleanup Scylla's code and reduce content in the root directory, this header file is simply removed.
This is code cleanup - no need to backport.
Fixes: https://github.com/scylladb/scylladb/issues/22117
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#22139
The way that the "test/cqlpy/run --release" feature runs older Scylla
releases is that it takes *today*'s command line parameters and "fixes"
it to conform to what old releases took. This approach was easy to
implement (and the resulting "--release" feature is super useful), but
the downside is that we need to update this fixup code whenever we add
new options to the Scylla command line used by test/cqlpy/run.py.
Commit d04f376 made test/cqlpy/run.py use a new option
"--experimental-features=views-with-tablets", so now we need to remove
it when running older versions of Scylla. So this is what we do in this
patch.
Fixes#22349
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22350
The small cqlpy test in this patch is a regression test for issue #14390,
which claimed that the Scylla-only "tombstone_gc" option is missing from
the output of "describe table".
This test shows that this report is *not* true, at least not when the
"server-side describe" is used. "test/cqlpy/run --release ..." shows
that this test passes on master and also for Scylla versions all the
way back to Scylla 5.2 (Scylla 5.1 did not support server-side
describe, so the test fails for that reason).
This suggests that the report in issue #14390 was for old-style
client-side (cqlsh) describe, which we no longer support, so this
issue can be closed.
Fixes#14390.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22354
in this changeset, some misspellings identified by codespell were corrected.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#22301
* github.com:scylladb/scylladb:
ent/encryption: rename "sie" to "get_opt"
ent,main: fix misspellings
Add a paragraph documenting the decision to deprecate
the COMPACT STORAGE feature, and instruct the user
how to enable the feature despite that.
Note that we don't have an official migration strategy
for users like `DROP COMPACT STORAGE`, which is not
implemented at this time (See #3882).
Fixes#16375
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.
Yet, there's is nothing in the code that prevents users
from creating such tables.
This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage` that require users
to opt-in in order to create new tables WITH COMPACT STORAGE.
The option is currently set to `true` by default in db/config
to reduce the churn to tests and to `false` in scylla.yaml,
for new clusters.
TODO: once regressions tests that use compact storage
are converted to enable the option, change the default in
db/config to false.
A unit test was added to test/cql-pytest that
checks that the respective cql query fails as expected
with the default option or when it is explicitly set to `false`,
and that the query succeeds when the option is set to `true`.
Note that `check_restricted_table_properties` already
returns an optional warning, but it is only logged
but not returned in the `prepared_statement`.
Fixing that is out of the scope of this patch.
See https://github.com/scylladb/scylladb/issues/20945
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace remaining uses of boost::adaptors::transformed with std::views::transform
to reduce Boost dependencies, following the migration pattern established in
bab12e3a. This change addresses recently merged code that reintroduced Boost
header dependencies through boost::adaptors::transformed usage.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22365
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).
Users can see running resize tasks - finished tasks are not
presented with the task manager API.
A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.
Fixes: #21366.
Fixes: #21367.
No backport, new feature
Closesscylladb/scylladb#21891
* github.com:scylladb/scylladb:
test: boost: check resize_task_info in tablet_test.cc
test: add tests to check revoked resize virtual tasks
test: add tests to check the list of resize virtual tasks
test: add tests to check spilt and merge virtual tasks status
test: test_tablet_tasks: generalize functions
replica: service: add split virtual task's children
replica: service: pass parent info down to storage_group::split
tasks: children of virtual tasks aren't internal by default
tasks: initialize shard in task_info ctor
service: extend tablet_virtual_task::abort
service: retrun status_helper struct from tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::wait
tasks: add suspended task state
service: extend tablet_virtual_task::get_status
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: add service::task_manager_module::get_nodes
tasks: add task_manager::get_nodes
tasks: drop noexcept from module::get_nodes
replica: service: add resize_task_info static column to system.tablets
locator: extend tablet_task_info to cover resize tasks
Introduces a comprehensive audit system to track database operations for security
and compliance purposes. This change includes:
Core Components:
- New audit subsystem for logging database operations
- Service level integration for proper resource management
- CQL statement tracking with operation categories
- Login process integration for tenant management
Key Features:
- Configurable audit logging (syslog/table)
- Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN)
- Selective auditing by keyspace/table
- Password sanitization in audit logs
- Service level shares support (1-1000) for workload prioritization
- Proper lifecycle management and cleanup
I ran the dtests for audit (manually enabled) and they pass.
The in-repo tests pass.
Notably, there should be no non-whitespace changes between this and scylla-enterprise
Fixesscylladb/scylla-enterprise#4999Closesscylladb/scylladb#22147
* github.com:scylladb/scylladb:
audit: Add shares support to service level management
audit: Add service level support to CQL login process
audit: Add support to CQL statements
audit: Integrate audit subsystem into Scylla main process
audit: Add documentation for the audit subsystem
audit: Add the audit subsystem
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.
Fixes: scylladb/scylla#21777
* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
hint manager: do not translate ip to id in case hint manager is stopped already
locator: token_metadata: drop update_host_id() function that does nothing now
locator: topology: drop indexing by ips
repair: drop unneeded code
storage_service: use host_id to look for a node in on_alive handler
storage_proxy: translate ips to ids in forward array using gossiper
locator: topology: remove unused functions
storage_service: check for outdated ip in on_change notification in the peers table
storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
topology coordinator: change connection dropping code to work on host ids
cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
locator: drop unused function from tablet_effective_replication_map
api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
locator: token_metadata: remove unused ip based functions
locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
gossiper: drop get_unreachable_token_owners functions
storage_service: use gossiper to map ip to id in node_ops operations
storage_service: fix indentation after the last patch
storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
token_metadata: drop no longer used functions
...
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22201
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.
Fixes: #21480Closesscylladb/scylladb#22251
The API /storage_service/truncate/{ks} returns an unimplemented
error when invoked. As we already have a CQL command,
`TRUNCATE TABLE ks.cf` that causes the table to be truncated on all
nodes, the API can be dropped. Due to the error, it is unused.
Fixes https://github.com/scylladb/scylladb/issues/10520
No backport is required. A small cleanup of not working API.
Closesscylladb/scylladb#22258
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.
This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.
Fixes#21015
This fixes a regression and should be backported.
Closesscylladb/scylladb#22263
* github.com:scylladb/scylladb:
sstable_directory: do not load remote sstables in process_descriptor
sstable_directory: update `load_sstable()` definition
sstable_directory: reintroduce `get_shards_for_this_sstable()`
If a tablet repair is scheduled by tablet repair scheduler, the repair
time for tombstone gc will be updated when the system.tablet.repair_time
is updated. Skip updating using rpc calls in this case.
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.
Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.
Fixes#17507
The cache of the hints and batchlog flush makes the exact repair time
check difficult in the test. Disabling it for two repair tests
that check the exact repair time.
The repair time returned by repair_service::repair_tablet considers the
hints and batchlog flush time, so it could be used for the tombstone gc
purpose.
Declaring-but-not-defining a fully specialized template is a great way to
cut dependencies between users and providers, but unfortunately not
supported for variable templates. Clang 18 does support it, but
apparently it is a misinterpretation of the standard, and was removed
in clang 19.
We started using this non-feature in 7ed89266b3.
The fix is to use function templates. This is more verbose as
each specialization needs to define a static variable to return,
but is fully supported.
Closesscylladb/scylladb#22299
The commit b39ca29b3c introduced detection of admission-waiter
anomaly and dumps permit diagnostics as soon as the semaphore did
not admit readers even though it could.
Later on, the commit bf3d0b3543 introduces the optimization where
the admission check is moved to the fiber processing the _read_list.
Since the semaphore no longer admits readers as soon as it can,
dumping diagnostic errors is not necessary as the situation is not
abnormal.
Closesscylladb/scylladb#22344
For several years now, we have seen a strange, and very rare, flakiness
in Alternator tests described in issue #17564: We see all the test pass,
pytest declares them to have passed, and while Python is existing, it
crashes with a signal 11 (SIGSEGV). Because this happens exclusively in
test/alternator and never in the test/cqlpy, we suspect that something
that the test/alternator leaves behind but test/cqlpy does not, causes
some race and crashes during shutdown.
The immediate suspect is the boto3 library, or rather, the urllib3 library
which it uses. This is more-or-less the only thing that test/alternator
does which test/cqlpy doesn't. The urllib3 library keeps around pools of
reusable connections, and it's possible (although I don't actually have any
proof for it) that these open connections may cause a crash during shutdown.
So in this patch I add to the "dynamodb" and "dynamodbstreams" fixtures
(which all Alternator tests use to connect to the server), a teardown which
calls close() for the boto3 client object. This close() call percolates
down to calling clear() on urllib3's PoolManager. Hopefully, this will
make some difference in the chance to crash during shutdown - and if it
doesn't, it won't hurt.
Refs #17564Closesscylladb/scylladb#22341
The code checks that it does not run for an ip address that is no longer
in use (after ip address change). To check that we can use peers table
and see if the host id is mapped to the address. If yes, this is the
latest address for this host id otherwise this is an outdated entry.
It is used by truncate code only and even there it only check if the
returned set is not empty. Check for dead token owners in the truncation
code directly.
Replace operation is special though. In case of replacing with the same
IP the gossiper will not have the mapping, and node_ops RPC
unfortunately does not send host id of a replaced node. For replace we
consult peers table instead to find the old owner of the IP. A node that
is replacing (the coordinator of the replace) will not have it though,
but luckily it is not needed since it updates metadata during
join_topology() anyway. The only thing that is missing there is
add_replacing_endpoint() call which the patch adds.
host_id_or_endpoint is a helper class that hold either id or ip and
translate one into another on demand. Use gossiper to do a translation
there instead of token_metadata since we want to drop ip based APIs from
the later.
Currently the entry is removed only if ip is not used by any normal or
transitioning node. This is done to not remove a wrong entry that just
happen to use the same ip, but the same can be achieved by checking host
id in the entry.
We want to drop ips from token_metadata so move to use host id based
counterparts. Messaging service gets a function that maps from ips to id
when is starts listening.
Instead use gossiper and peers table to retrieve same information.
Token_metadata is created from the mix of those two anyway. The goal is
to drop ips from token_metadata entirely.
The functions are called from RESful API so has to return ips for backwards
compatibility, but internally we can use host ids as long as possible
and convert to ips just before returning. This also drops usage of ip
based erm function which we want to get rid of.
The function is called by RESful API so has to return ips for backwards
compatibility, but internally we can use host ids as long as possible
and convert to ips just before returning. This also drops usage of ip
based erm function which we want to get rid of.
locator/util.hh already has get_range_to_address_map which is exactly
like the one in the storage_service. So remove the later one and use the
former instead.
cmake doesn't set a `-ffile-prefix-map` for source files. Among other things,
this results in absolute paths in Scylla logs:
```
Jan 11 09:59:11.462214 longevity-tls-50gb-3d-master-db-node-2dcd4a4a-5 scylla[16339]: scylla: /jenkins/workspace/scylla-master/next/scylla/utils/refcounted.hh:23: utils::refcounted::~refcounted(): Assertion `_count == 0' failed.
```
And it results in absolute paths in gdb, which makes it a hassle to get gdb to display
source code during debugging. (A build-specific `substitute-path` has to be
configured for that).
There is a `-file-prefix-map` rule for `CMAKE_BINARY_DIR`,
but it's wrong.
Patch dbb056f4f7, which added it,
was misguided.
What we want is to strip the leading components of paths up to
the repository directory, both in __FILE__ macros and in debug info.
For example, we want to convert /home/michal/scylla/replica/table.cc to
replica/table.cc or ./replica/table.cc, both in Scylla logs and in gdb.
What the current rule does is it maps `/home/michal/scylla/build` to `.`,
which is wrong: it doesn't do anything about the paths outside of `build`,
which are the ones we actually care about.
This patch fixes the problem.
Closesscylladb/scylladb#22311
wasm32-wasi has been removed in Rust 1.84 (Jan 5th, 2025). if one
compiles the tree with Rust 1.84 or up, following build failure is
expected:
```
[2/305] Building WASM /home/kefu/dev/scylladb/build/wasm/return_input.wasm
FAILED: wasm/return_input.wasm /home/kefu/dev/scylladb/build/wasm/return_input.wasm
cd /home/kefu/dev/scylladb/test/resource/wasm/rust && /usr/bin/cargo build --target=wasm32-wasi --example=return_input --locked --manifest-path=Cargo.toml --target-dir=/home/kefu/dev/scylladb/build/test/resource/wasm/rust && wasm-opt /home/kefu/dev/scylladb/build/test/resource/wasm/rust/wasm32-wasi//debug/examples/return_input.wasm -Oz -o /home/kefu/dev/scylladb/build/wasm/return_input.wasm && wasm-strip /home/kefu/dev/scylladb/build/wasm/return_input.wasm
error: failed to run `rustc` to learn about target-specific information
Caused by:
process didn't exit successfully: `rustc - --crate-name ___ --print=file-names --target wasm32-wasi --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=split-debuginfo --print=crate-name --print=cfg` (exit status: 1)
--- stderr
error: Error loading target specification: Could not find specification for target "wasm32-wasi". Run `rustc --print target-list` for a list of built-in targets
```
in order to workaround this issue, let's check for supported target,
and use wasm32-wasip1 if wasm32-wasi is not listed as the supported
target.
Refs #20878
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22320
While this isn't strictly needed for anything, messaging_service
is supposed to clear its RPC connection objects on stop,
for debuggability reasons.
But a recent change in this area broke that.
std::bind creates copies of its arguments, so the `m.clear()`
statement in stop_client() only clears a copy of the vector of shared pointers,
instead of clearing the original vector. This patch fixes that.
Fixes#22245Closesscylladb/scylladb#22333
Currently, tests are reusing the cluster. This leads to the situation
when test passes and leaves the cluster broken, that the next tests will
try to clean up the Scylla working directory during starting the node.
Timeout for starting is set to two minutes by default and sometimes
cleaning the mess after several tests can take more time, so tests fails
during adding the node to the cluster. Current PR marks the cluster
dirty after the test, so no need to clean the Scylla working directory.
The disadvantage of this way is increasing the time for tests execution.
Observable increase is approximately one minutes for one repeat in dev
mode: 22 min 35s vs. 23 min 41s.
Closesscylladb/scylladb#22274
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.
Fixesscylladb/scylladb#20814
Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.
Closesscylladb/scylladb#22253
* github.com:scylladb/scylladb:
raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
raft: Handle non-critical config update errors in when changing status to voter.
test: Add test to check that a node does not fail on unknown commit status error when starting up.
raft: Add run_op_with_retry in raft_group0.
One of its caller is in the RESTful API which gets ips from the user, so
we convert ips to ids inside the API handler using gossiper before
calling the function. We need to deprecate ip based API and move to host
id based.
We will add code that expects id to ip mapping to exist. If it does not
it is better to fail earlier during testing, so add a function that
calls internal error in case there is no mapping.
Introduces shares-based workload prioritization for service levels, allowing
fine-grained control over resource allocation between tenants. Key changes:
- Add shares option to service level configuration:
- Valid range: 1-1000 shares
- Default value: 1000 shares
- Enterprise-only feature gated by WORKLOAD_PRIORITIZATION feature flag
- Extend CQL interface:
- Add shares parameter to CREATE/ALTER SERVICE_LEVEL
- Add shares column to system_distributed.service_levels
- Add percentage calculation to LIST SERVICE_LEVELS
- Add shares to DESCRIBE EFFECTIVE SERVICE_LEVEL output
- Add validation:
- Enforce shares range (1-1000)
- Validate enterprise feature flag
- Handle unset/delete markers properly
- Update service level statements:
- Add shares validation to CREATE/ALTER operations
- Preserve shares through default value replacement
- Add proper decomposition for shares values in result sets
This change enables operators to control relative resource allocation between
tenants using proportional share scheduling, while maintaining backward
compatibility with existing service level configurations.
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.
Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.
Fixes#20638
The PR contains few additional small fixes in separate commits related to the view build status table.
It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts.
For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048.
backport to fix the bugs that affects previous versions and improve CI stability
Closesscylladb/scylladb#22307
* github.com:scylladb/scylladb:
view_builder: hold semaphore during entire startup
view_builder: pass view name by value to write_view_build_status
view_builder: write status to tables before starting to build
Fixes#22314
Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.
This change integrates service level functionality into the CQL authentication and connection handling:
- Add scheduling_group_name to client_data to track service level assignments
- Extend SASL challenge interface to expose authenticated username
- Modify connection processing to support tenant switching:
- Add switch_tenant() method to handle scheduling group changes
- Add process_until_tenant_switch() to handle request processing boundaries
- Implement no_tenant() default executor
- Add execute_under_tenant_type for scheduling group management
- Update connection lifecycle to properly handle service level changes:
- Initialize connections with default scheduling group
- Support dynamic scheduling group updates when service levels change
- Ensure proper cleanup of scheduling group assignments
The changes enable proper scheduling group assignment and management based on
authenticated users' service levels, while maintaining backward compatibility
for connections without service level assignments.
Integrates audit functionality into CQL statement processing to enable tracking of database operations. Key changes:
- Add audit_info and statement_category to all CQL statements
- Implement audit categories for different statement types:
- DDL: Schema altering statements (CREATE/ALTER/DROP)
- DML: Data manipulation (INSERT/UPDATE/DELETE/TRUNCATE/USE)
- DCL: Access control (GRANT/REVOKE/CREATE ROLE)
- QUERY: SELECT statements
- ADMIN: Service level operations
- Add audit inspection points in query processing:
- Before statement execution
- After access checks
- After statement completion
- On execution failures
- Add password sanitization for role management statements
- Mask plaintext passwords in audit logs
- Handle both direct password parameters and options maps
- Preserve query structure while hiding sensitive data
- Modify prepared statement lifecycle to carry audit context
- Pass audit info during statement preparation
- Track audit info through statement execution
- Support batch statement auditing
This change enables comprehensive auditing of CQL operations while ensuring sensitive data is properly masked in audit logs.
Adds core integration of the audit subsystem into Scylla's main process flow. Changes include:
- Import audit subsystem header
- Initialize audit system during server startup using configuration and token metadata
- Start audit system after API server initialization with query processor and memory manager
- Add proper shutdown sequence for audit system using RAII pattern
- Add error handling for audit system initialization failures
The audit system is now properly integrated into Scylla's lifecycle, ensuring:
- Correct initialization order relative to other subsystems
- Proper resource cleanup during shutdown
- Graceful error handling for initialization failures
Adds detailed documentation covering the new audit subsystem:
- Add new audit.md design document explaining:
- Core concepts and design decisions
- CQL extensions for audit management
- Implementation details and trigger evaluation
- Prior art references from other databases
- Add user-facing documentation:
- New auditing.rst guide with configuration and usage details
- Integration with security documentation index
- Updates to cluster management procedures
- Updates to security checklist
The documentation covers all aspects of the audit system including:
- Configuration options and storage backends (syslog/table)
- Audit categories (DCL/DDL/AUTH/DML/QUERY/ADMIN)
- Permission model and security considerations
- Failure handling and logging
- Example configurations and output formats
This ensures users have complete guidance for setting up and using
the new audit capabilities.
This change introduces a new audit subsystem that allows tracking and logging of database operations for security and compliance purposes. Key features include:
- Configurable audit logging to either syslog or a dedicated system table (audit.audit_log)
- Selective auditing based on:
- Operation categories (QUERY, DML, DDL, DCL, AUTH, ADMIN)
- Specific keyspaces
- Specific tables
- New configuration options:
- audit: Controls audit destination (none/syslog/table)
- audit_categories: Comma-separated list of operation categories to audit
- audit_tables: Specific tables to audit
- audit_keyspaces: Specific keyspaces to audit
- audit_unix_socket_path: Path for syslog socket
- audit_syslog_write_buffer_size: Buffer size for syslog writes
The audit logs capture details including:
- Operation timestamp
- Node and client IP addresses
- Operation category and query
- Username
- Success/failure status
- Affected keyspace and table names
In this PR, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:
(a) The following things happened in order:
1. The view builder would be constructed.
2. Right after that, a deferred lambda would be created to stop the
view builder during shutdown.
3. group0_service would be started.
4. A deferred lambda stopping group0_service would be created right
after that.
5. The view builder would be started.
(b) Because the view builder depends on group0_client, it couldn't be
started before starting group0_service. On the other hand, other
services depend on the view builder, e.g. the stream manager. That
makes changing the order of initialization a difficult problem,
so we want to avoid doing that unless we're sure it's the right
choice.
(c) Since the view builder uses group0_client, there was a possibility
of running into a segmentation fault issue in the following
scenario:
1. A call to `view_builder::mark_view_build_success()` is issued.
2. We stop group0_service.
3. `view_builder::mark_view_build_success()` calls
`announce_with_raft()`, which leads to a use-after-free because
group0_service has already been destroyed.
This very scenario took place in scylladb/scylladb#20772.
Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534,
so we revert that patch. These changes are the second attempt
to the problem where we want to solve it in a safer manner.
The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.
Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.
Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too.
Fixesscylladb/scylladb#20772Fixesscylladb/scylladb#21534Closesscylladb/scylladb#21909
* github.com:scylladb/scylladb:
test/topology_custom: Add test for Scylla with disabled view building
main, view: Pair view builder drain with its start
Revert "main,cql_test_env: start group0_service before view_builder"
To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call.
This ensures the operation doesn't get stuck indefinitely.
for retrying.
There are several places in `raft_group0` where almost identical code is
used for retrying `modify_config` in case of `commit_status_unknown`
error. To avoid code duplication all these places were changed to
use a new wrapper `run_op_with_retry`.
to voter.
When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing
a new Raft record, for instance, if the Raft leader changes during this time. These errors are not
critical and should not cause a node crash, as the action can be retried.
Fixesscylladb/scylladb#20814
Currently, our relocatable package doesn't contains p11-kit-trust.so
since it dynamically loaded, not showing on "ldd" results
(Relocatable packaging script finds dependent libraries by "ldd").
So we need to add it on create-relocatable-pacakge.py.
Also, we have two more problems:
1. p11 module load path is defined as "/usr/lib64/pkcs11", not
referencing to /opt/scylladb/libreloc
(and also RedHat variants uses different path than Debian variants)
2. ca-trust-source path is configured on build time (on Fedora),
it compatible with RedHat variants but not compatible with Debian
variants
To solve these problems, we need to override default p11-kit
configuration.
To do so, we need to add an configuration file to
/opt/scylladb/share/pkcs11/modules/p11-kit-trust.module.
Also, ofcause p11-kit doesn't reference /opt/scylladb by default, we
need to override load path by p11_kit_override_system_files().
On the configuration file, we can specify module load path by "modules: <path>",
and also we can specify ca-trust-source path by "x-init-reservied: paths=<path>".
Fixesscylladb/scylladb#13904Closesscylladb/scylladb#22302
error when starting up.
Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.
Test for scylladb/scylladb#20814
Since when calling `modify_config` it's quite often we need to do
retries, to avoid code duplication, a function wrapper that allows
a function to be called with automatic retries in case of failures
was added.
- use "foo not in bar" instead of "not foo in bar"
- test/pylib: use foo instead of `'{}'.format(foo)`
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#22066
* github.com:scylladb/scylladb:
test/pylib: use `foo` instead of `'{}'.format(foo)`
test/pylib: use "foo not in bar" instead of "not foo in bar"
As part of #18750, we added a CQL statement CREATE ROLE WITH SALTED HASH that prevented hashing a password when creating a role, effectively leading to inserting a hash given by the user directly into the database. In #21350, we noticed that Cassandra had implemented a CQL statement of similar semantics but different syntax. We decided to rename Scylla's statement to be compatible with Cassandra. Unfortunately, we didn't notice one more difference between what we had in Scylla and what was part of Cassandra.
Scylla's statement was originally supposed to only be used when restoring the schema and the user needn't have to be aware of its existence at all: the database produced a sequence of CQL statements that the user saved to a file and when a need to restore the schema arose, they would execute the contents of the file. That's why that although we documented the feature, it was only done in the necessary places. Those that weren't related to the backup & restore procedure were deliberately skipped.
Cassandra, on the other hand, added the statement for a different purpose (for details, see the relevant issue) and it was supposed to be used by the user by design. The statement is also documented as such.
Since we want to preserve compatibility with Cassandra, we document the statement and its semantics in the user documentation, explicitly implying that it can be used by the user.
We also add a test verifying that logging in works correctly.
Fixesscylladb/scylladb#21691
Backport: not needed. The relevant code didn't make it to 6.2 or any previous version of OSS.
Closesscylladb/scylladb#21752
* github.com:scylladb/scylladb:
docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD
test/boost: Add test for creating roles with hashed passwords
"sie" is the short for "system info encryption". it is a wrapper around
a `opts` map so we can get the individual option by providing a default
value via an `optional<>` return value. but "sie" could be difficult to
understand without more context. and it is used like a function -- we
get the individual option using its operator().
so, in order to improve the readability, in this change, we rename it
to "get_opt".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
these misspellings are identified by codespell. they are either in
comment or logging messages. let's fix them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Previously, we hardwire the container to a previous frozen toolchain
image. but at the time of writing, the tree does not compile in the
specified toolchain image anymore, after the required building
environment is updated, and toolchain was updated accordingly.
in order to improve the maintability, let's reuse `read-toolchain.yaml`
job which reads `tools/toolchain/image`, so we don't have to hardwire
the container used for building the tree with the latest seastar. this
should address the build failure surfaced recently.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22287
The "--experimental" option was removed in commit f6cca741ea. Using this
deprecated option now causes Scylla to fail with the error:
```
error: the argument ('on') for option '--experimental-features' is invalid
```
So, in this change, let's update the docker entry point script to use
`--experimental-features` command line option instead. The related
document is updated accordingly.
Fixesscylladb/scylladb#22207
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22283
memtable_flush_period test sets the flush period to 200ms and checks
whether the data is flushed after 500ms.
When flush period is set, the timer is armed with the given value.
On expiration, memtables are flushed and then the timer is rearmed.
There is no certainty that during 500ms the flush finishes, though.
Check if after 500ms flush has started. Wait until there is an sstable.
Fixes: #21965.
Closesscylladb/scylladb#22162
In main.cc storage_service is started before and stopped after
repair_service. storage_service keeps a reference to sharded
repair_service and calls its methods, but nothing ensures that
repair_service's local instance would be alive for the whole
execution of the method.
Add a gate to repair_service and enter it in storage_service
before executing methods on local instances of repair_service.
Fixes: #21964.
Closesscylladb/scylladb#22145
In Scylla there are two options that control IO bandwidth limit -- the /storage_service/(compaction|stream)_throughput REST API endpoints. The endpoints are partially implemented and have no counterparts in the nodetool.
This set implements the missing bits and adds tests for new functionality.
Closesscylladb/scylladb#21877
* github.com:scylladb/scylladb:
nodetool: Implement [gs]etstreamthroughput commands
nodetool: Implement [gs]etcompationthroughput commands
test: Add validation of how IO-updating endpoints work
api: Implement /storage_service/(stream|compaction)_throughput endpoints
api: Disqualify const config reference
api: Implement /storage_service/stream_throughput endpoint
api: Move stream throughput set/get endpoints from storage service block
api: Move set_compaction_throughput_mb_per_sec to config block
util: Include fmt/ranges.h in config_file.hh
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.
The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.
Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.
Fixesscylladb/scylladb#20638
They will be useful for hosts and DCs selection for the repair
scheduler. It is not implemented yet. Adding it earlier, so we do not
need to change the system tabler later.
Closesscylladb/scylladb#21985
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.
This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.
Fixes#21015
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Updated `sstable_directory::load_sstable()` to directly accept
`data_dictionary::storage_options` instead of a function that returns
the same. This is required to ensure `process_descriptor()` loads the
sstable only once in the right shard.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This series adds WCU support for the Alternator update item.
This motivation behind it, is to have a rough estimation of what a similar operation would have taken from WCU perspective if used with DynamoDB.
The calculation is done while minimal overhead is the prime objective, the results are values that is less or equal to what it would have been in DynamoDB
** New feature, no need to backport. **
Closesscylladb/scylladb#21999
* github.com:scylladb/scylladb:
alternator/test_returnconsumedcapacity.py: update item
alternator/executor.cc: Add WCU for update_item
This update addresses an issue in the mutation diff calculation
algorithm used during read repair. Previously, the algorithm used
`token` as the hashmap key. Since `token` is calculated basing on the
Murmur3 hash function, it could generate duplicate values for different
partition keys, causing corruption in the affected rows' values.
Fixesscylladb/scylladb#19101
Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2
Closesscylladb/scylladb#21996
* github.com:scylladb/scylladb:
storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
test: Add test case for checking read repair diff calculation when having conflicting keys.
Since we manage ip to id mapping directly in gossiper now we need to
load the mapping on boot. We already do it anyway, but only due to a bug
which checks raft topology mode config before it is set, so the code
thinks that it is in the gossiper mode and loads peers table into the
gossiper and token metadata. Fix the bug and load peers into the gossiper
only since token metadata is managed by raft.
The series also removes address map related test that no longer checks
anything and replace it with unit test.
It also adds the dc/rack check to "join node" rpc. The check is done
during shadow round now, but for it to work it requires dc/rack to be
propagated through the gossiper and we want to eventually drop it.
Ref: scylladb/scylladb#21777
* 'load-peers' of https://github.com/gleb-cloudius/scylla:
topology coordinator: reject replace request if topology does not match
gossiper: fix the logic of shadow_round parameter
storage_service: do not add endpoint to the gossiper during topology loading.
storage_service: load peers into gossiper on boot in raft topology mode
storage_service: set raft topology change mode before using it in join_cluster
locator: drop inet_address usage to figure out per dc/rack replication
test: drop test_old_ip_notification_repro.py
test: address_map: check generation handling during entry addition
It now parses only table names from its "cf" argument. Parsing
table_infos has two benefits -- it makes it possible to hide
parse_tables() thus keeping less validation code around, and the
subsequent db.find_column_family() call can avoid re-lookup of table
uuid by its ks:table pair.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Several places call parse_fully_qualified_cf_name() and get_uuid()
helpers one after another. Previous patch introduced the
parse_table_info() one that wraps both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method gets "fully qualified" table name, which is 'ks:cf' string
and returns back the resolved table_id value. Some callers will benefit
from knowing the parsed 'cf' part of it (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This argument is needed to find table by ks:cf prair. The "table" part
is taken from the vector of table_info-s, but table_info-s have table_id
value onboard, and the table can be found by this id. So keyspace is not
needed any longer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All callers of it already have one. Next patch will make even more use
of those passed table_info-s.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's no longer used outside of api/storage_service.cc. It's not yet
possible to remove it completely, but it's better not to encourage
others to use it outside of its current .cc file.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Callers of this method provide vectors of two kinds:
- explicitly single-entry one from endpoints that work on single table
- vector returned by parse_table_infos()
The latter helper, if it gets empty list of tables from user, populates
its return value with all tables from the given keyspace.
The removed check became obsolete after recent changes. Prior to those,
the 2nd case provided vector from another helper called parse_tables(),
which could return empty result.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_tables_...() helper called here accept vector by value, so the
existing code copies it. It's better to move, all the more so next
changes will make this place pass vectors with more data onboard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This handler doesn't check if the requested table exists. If it doesn't
it will throw later anyway, but most of other endpoints that work with
tables check table early. This early check allows throwing bad-param
exception on missing table, not internal-server-error one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper returns uuid, but also "Validates" the table exists by
calling db.find_uuid() and throwing bad_param exception on error.
This change will allow making for_table_on_all_shards() smaller a bit
later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one is the same as parse_tables(), but returns back name:id pairs.
This change will allow making for_table_on_all_shards() smaller a bit
later, as well as removing the parse_tables() code eventually.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Developers using asyncio.gather() often assume that it waits for all futures (awaitables) givens.
But this isn't true when the return_exceptions parameter is False, which is the default.
In that case, as soon as one future completes with an exception, the gather() call will return this exception immediately, and some of the finished tasks may continue to run in the background.
This is bad for applications that use gather() to ensure that a list of background tasks has all completed.
So such applications must use asyncio.gather() with return_exceptions=True, to wait for all given futures to complete either successfully or unsuccessfully.
Closesscylladb/scylladb#22252
Said fields in statistics are of type
`disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled
as array of regular strings. However these fields store exploded
clustering keys, so the elements store binary data and converting to
string can yield invalid UTF-8 characters that certain JSON parsers (jq,
or python's json) can choke on. Fix this by treating them as binary and
using `to_hex()` to convert them to string. This requires some massaging
of the json_dumper: passing field offset to all visit() methods and
using a caller-provided disk-string to sstring converter to convert disk
strings to sstring, so in the case of statistics, these fields can be
intercepted and properly handled.
While at it, the type of these fields is also fixed in the
documentation.
Before:
"min_column_names": [
"��Z���\u0011�\u0012ŷ4^��<",
"�2y\u0000�}\u007f"
],
"max_column_names": [
"��Z���\u0011�\u0012ŷ4^��<",
"}��B\u0019l%^"
],
After:
"min_column_names": [
"9dd55a92bc8811ef12c5b7345eadf73c",
"80327900e2827d7f"
],
"max_column_names": [
"9dd55a92bc8811ef12c5b7345eadf73c",
"7df79242196c255e"
],
Fixes: #22078Closesscylladb/scylladb#22225
scylla-sstable tries to read scylla.yaml via the following sequence:
1) Use user-provided location is provided (--scylla-yaml-file parameter)
2) Use the environment variables SCYLLA_HOME and/or SCYLLA_CONF if set
3) Use the default location ./conf/scylla.yaml
Step 3 is fine on dev machines, where the binaries are usually invoked
from scylla.git, which does have conf/scylla.yaml, but it doesn't work
on production machines, where the default location for scylla.yaml is
/etc/scylla/scylla.yaml. To reduce friction when used on production
machines, add another fallback in case (3) fails, which tries to read
scylla.yaml from /etc/scylla/scylla.yaml location.
Fixes: scylladb/scylladb#22202Closesscylladb/scylladb#22241
When destroying a test cluster, ScyllaCluster.stop() calls ScyllaServer.stop()
for each running server. Previously, non-zero exit status codes from scylla
servers were silently ignored during test teardown.
This change modifies the logging behavior to print the exit status code when
a scylla server exits with a non-zero status. This helps developers quickly
identify potential issues or unexpected terminations during test runs.
Differences in handling:
- Before: Non-zero exit codes were not logged
- After: Non-zero exit codes are printed, providing visibility into
server termination errors
This improvement aids in diagnosing intermittent test failures or
unexpected server shutdowns during test execution.
Refs #21742
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21934
Previously, we created a vector<utils_json::histogram> and returned it
by copying into a future. Since histogram is a JSON representation of
ihistogram, it can be heavyweight, making the vector copy overhead
significant.
Now we move the vector into the returned future instead of copying it,
eliminating the deep copy overhead. The APIs backed by this function
are marked deprecated, so this performance improvement is not that
important.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22004
In 0b0e661a85, we brought abseil back as a submodule, and we
added absl::headers as an interface library for importing
abseil headers' include directory. And:
```console
$ patchelf --print-rpath build/RelWithDebInfo/scylla
/home/kefu/dev/scylla/idl/absl::headers
```
In this change, we remove `absl::headers` from
`target_link_directories()` as it's an interface library that only
provides header files, not linkable libraries. This fixes the incorrect
inclusion of absl::headers in the rpath of the scylla executable.
Additionally, remove abseil library dependencies from the idl target
since none of the idl source files directly include abseil headers.
After this change,
```console
$ patchelf --print-rpath build/RelWithDebInfo/scylla
```
the output of `pathelf` is now empty.
Fixes#22265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22266
We restore a snapshot of table by streaming the sstables of
the given snapshot of the table using
`sstable_streamer::stream_sstable_mutations()` in batches. This function
reads mutations from a set of sstables, and streams them to the target
nodes. Due to the limit of this function, we are not able to track the
progress in bytes.
Previously, progress tracking used individual sstables as units, which caused
inaccuracies with tablet-distributed tables, where:
- An sstable spanning multiple tablets could be counted multiple times
- Progress reporting could become misleading (e.g., showing "40" progress
for a table with 10 sstables)
This change introduces a more robust progress tracking method:
- Use "batch" as the unit of progress instead of individual sstables.
Each batch represents a tablet when restoring a table snapshot if
the tablet being restored is distributed with tablets. When it comes
to tables distributed with vnode, each batch represents an sstable.
- Stream sstables for each tablet separately, handling both partially and
fully contained sstables
- Calculate progress based on the total number of sstables being streamed
- Skip tablet IDs with no owned tokens
For vnode-distributed tables, the number of "batches" directly corresponds
to the number of sstables, ensuring:
- Consistent progress reporting across different table distribution models
- Simplified implementation
- Accurate representation of restore progress
The new approach provides a more reliable and uniform method of tracking
restoration progress across different table distribution strategies.
Also, Corrected the use of `_sstables.size()` in
`sstable_streamer::stream_sstables()`. It addressed a review comment
from Pavel that was inadvertently overlooked during previous rebasing
the commit of 5ab4932f34.
Fixesscylladb/scylladb#21816
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21841
Before this commit, there doesn't seem to have been a test verifying that
starting and shutting down Scylla behave correctly when the configuration
option `view_building` is set to false. In these changes, we add one.
In these changes, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:
(a) The following things happened in order:
1. The view builder would be constructed.
2. Right after that, a deferred lambda would be created to stop the
view builder during shutdown.
3. group0_service would be started.
4. A deferred lambda stopping group0_service would be created right
after that.
5. The view builder would be started.
(b) Because the view builder depends on group0_client, it couldn't be
started before starting group0_service. On the other hand, other
services depend on the view builder, e.g. the stream manager. That
makes changing the order of initialization a difficult problem,
so we want to avoid doing that unless we're sure it's the right
choice.
(c) Since the view builder uses group0_client, there was a possibility
of running into a segmentation fault issue in the following
scenario:
1. A call to `view_builder::mark_view_build_success()` is issued.
2. We stop group0_service.
3. `view_builder::mark_view_build_success()` calls
`announce_with_raft()`, which leads to a use-after-free because
group0_service has already been destroyed.
This very scenario took place in scylladb/scylladb#20772.
Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534.
We reverted that change in the previous commit. These changes are the
second attempt to the problem where we want to solve it in a safer manner.
The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.
Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.
Fixesscylladb/scylladb#20772
The patch solved a problem related to an initialization order
(scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534.
After moving the initialization of group0_service, it ended up being destroyed
AFTER the CDC generation service would. Since CDC generations are accessed
in `storage_service::topology_state_load()`:
```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```
we started getting the following failure:
```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```
We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.
This reverts commit 7bad8378c7.
Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631
EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data.
Introduces OpenSSL based file level encrypted storage, managed via a set of providers
ranging from local files to cloud KMS providers.
For a more comprehensive explanation, see the included docs (or if possible, original
source tree).
Manual bulk merge of EAR feature from enterprise repo to main scylla repo.
Breaks some features apart, but main EAR is still a humongous commit, because to separate this
I would have to mess with code incrementally, adding time and risk.
This PR includes the local file gen tool, tests and also p11 validation.
Note: CI will not execute the full tests unless master CI is set to provide the same environment
as the enterprise one. Not sure about the status of this ATM.
Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to
check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality
will be enabled, otherwise not.
Closesscylladb/scylladb#22233
* github.com:scylladb/scylladb:
docs: Add EAR docs
main/build: Add p11-kit and initialize
tools: Add local-file-key-generator tool
tests: Add EAR tests
tmpdir: shorten test tempdir path
EAR: port the ear feature from enterprise
cql_test_env: Add optional query timeout
schema/migration_manager: Add schema validate
sstables: add get_shared_components accessor
config/config_file: Add exports and definitions of config_type_for<>
the changes porting enterprise features to oss brought some
used include to the tree. so let's remove them. these unused
includes were identified by clang-include-cleaner.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22246
Instantiated only on shard 0.
Currently, only subscribe from unit test
Manual unit test using loop mount was added.
Note that the test requires sudo access
and root access to /dev/loop, so it cannot
run in rootless podman instance, and it'd
fail with Permission denied.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21523
This PR extends authentication with 2 mechanisms:
- a new role_manager subclass, which allows managing users via
LDAP server,
- a new authenticator, which delegates plaintext authentication
to a running saslauthd daemon.
The features have been ported from the enterprise repository
with their test.py tests and the documentation as part of
changing license to source available.
Fixes: scylladb/scylla-enterprise#5000Fixes: scylladb/scylla-enterprise#5001Closesscylladb/scylladb#22030
Replace boost with a standard facility; this reduces dependencies as lexical_cast depends on boost ranges.
Since std::from_chars() is chatty, we introduce utils::from_chars_exactly() to trade some flexibility for conciseness.
Small build time improvement, no backport needed.
Closesscylladb/scylladb#22164
* github.com:scylladb/scylladb:
error_injection: replace boost::lexical_cast with std::from_chars
utils: introduce from_chars_exactly()
Reintroduce `get_shards_for_this_sstable()` that was removed in commit
ad375fbb. This will be used in the following patch to ensure that an
sstable is loaded only once.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The test is skipped in debug mode, because the preparation of revoke
takes too long and wait request, which needs to be started before
the preparation, hits timeout.
Pass task_info down to storage_group::split.
In the following patches, it will be used to set the parent
of offstrategy_compaction_task_executor and split_compaction_task_executor
running as a part of the split. The task_info param will contain task
info of a split virtual task.
Currently, streaming_task_impl is the only existing child of any
virtual task. It overrides the is_internal definition so that it
is non-internal even though it has a parent.
This should apply to all children of all virtual tasks. Modify
task_manager::task::impl::is_internal so that children of virtual
tasks aren't internal by default.
Initialize shard in task_info constructor. All current usages do
not care about the shard of an empty task_info.
In the following patches we may need that for setting info about
virtual task parent.
Extend tablet_virtual_task::wait to support resize tasks.
To decide what is a state of a finished resize virtual task (done
or failed), the tablet count is checked. The task state is set to done,
if the tablet count before resize is different than after.
Extend tablet_virtual_task::contains to check resize operations.
Methods that do not support resize tasks return immediately if they
are handling split or merge task.
Add resize_task_info static column to system.tablets. Set or delete
resize_task_info value when the resize_decision is changed.
Reflect the column content in tablet_map.
When starting the view builder, we find all existing views in
`calculate_shard_build_step` and then register a listener for new views.
Between these steps we may yield and create a new view, then we miss
initializing the view build step for the new view, and we won't start
building it.
To fix this we first register the listener and then read existing views,
so a view can't be missed.
Fixesscylladb/scylladb#20338Closesscylladb/scylladb#22184
Bulk transfer of EAR functionality. Includes all providers etc.
Could maybe break up into smaller blocks, but once it gets down to
the core of it, would require messing with code instead of just moving.
So this is it.
Note: KMIP support is disabled unless you happen to have the kmipc
SDK in your scylla dir.
Adds optional encryption of sstables and commitlog, using block
level file encryption. Provides key sourcing from various sources,
such as local files or popular KMS systems.
Replace boost with a standard facility; this reduces dependencies
as lexical_cast depends on boost ranges.
As a side effect the exception error message is improved.
This is a replacement for boost::lexical_cast (but without its
long dependency chain). It wraps std::from_chars(), providing a less
flexible but also more concise interface.
The dict publication routine might throw raft::request_aborted when the node is
aborted. This doesn't deserve an ERROR log. Let's demote the log printed in this
case from ERROR to DEBUG.
Fixesscylladb/scylladb#22081Closesscylladb/scylladb#22211
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.
Fixesscylladb/scylladb#21807
Backport: not needed since this is a fix in the testing scripts.
Closesscylladb/scylladb#22192
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22199
Move the Scylla executable existence check from PythonTestSuite's constructor
to test execution time. This allows running unit tests that don't depend on
the scylla executable without building it first.
Previously, PythonTestSuite's constructor would fail if the Scylla executable
was missing, preventing even unrelated unit tests from running. Now, only
tests that actually require Scylla will fail if the executable is
missing.
Fixesscylladb/scylladb#22168
Refs scylladb/scylladb#19486
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22224
replica writes are delayed according to the view update backlog in order
to apply backpressure and reduce the rate of incoming base writes when
the backlog is large, allowing slow replicas to catch up.
previously the backlog calculation considered only the pending targets,
excluding targets that replied successfuly, probably due to confusion in
the code. instead, we want to consider the backlog of all the targets
participating in the write.
Fixesscylladb/scylladb#21672Closesscylladb/scylladb#21935
this unused include was identifier by clang-include-cleaner. after
auditing task_manager_module.hh, the report has been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22200
The definition of the template is in a source translation unit, but there
are also uses outside the translation unit. Without lto/pgo it worked due
to the definition in the translation unit, but with lto/pgo we can presume
the definition was inlined, so callers outside the translation unit did not
have anything to link with.
Fix by explicitly instantiating the template function.
Closesscylladb/scylladb#22136
* seastar 3133ecdd...a9bef537 (24):
> file: add file_system_space
> future: avoid inheriting from future payload type
> treewide: include fmt/ostream.h for using fmt::print()
> build: remove messages used for debugging
> demos: Rename websocket demo to websocket_server demo
> demos: Add a way to set port from cmd line in websocket demo
> tls: Add optional builder + future-wait to cert reload callback + expose rebuild
> rwlock: add try_hold_{read,write}_lock methods
> json: add moving push to json_list
> github: add a step to build "check-include-style"
> build: add a target for checking include style
> scheduling_group: use map for key configs instead of vector
> scheduling_group: fix indentation
> scheduling_group: fix race between scheduling group and key creation
> http: Make request writing functions public
> http: Expose connection_factory implementations
> metrics: Use separate type for shared metadata
> file: unexpected throw from inside noexcept
> metrics: Internalize metric label sets
> thread: optimize maybe_yield
> reactor: fix crash in pending registration task after poller dtor
> net: Fix ipv6 socket_address comparision
> reactor, linux-aio: factor out get_smp_count() lambda
> reactor, linux-aio: restore "available_aio" meaning after "reserve_iocbs"
Fixed usage of seastar metric label sets due to:
scylladb/seastar@733420d57 Merge 'metrics: Internalize metric label sets' from Stephan Dollberg
Closesscylladb/scylladb#22076
remove the "ScyllaDB Enterprise" labels in document. because
there is no need to differentiate ScyllaDB Enterprise from its OSS
variant, let's stop adding the "ScyllaDB Enterprise" labels to
enterprise-only features. this helps to reduce the confusion.
as we are still in the process of porting the enterprise features
to this repo, this change does not fixscylladb/scylladb#22175.
we will review the document again when completing the migration.
we also take this opportunity to stop referencing "Enterprise" in
the changed paragraph.
Refs scylladb/scylladb#22175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22177
in 047ce136, we cherry-picked the change adding
garbage-collection-ics.rst to the document. but it was still
referencing the git sha1 and version number in enterprise.
this change updates kb/garbage-collection-ics.rst, so that it
* references the git commit sha1 in this repo
* do not reference the version introducing this feature, as
per Anna Stuchlik
> As a rule, we should avoid documenting when something was
> introduced or set as a default because our documentation
> was versioned. Per-version information should be listed in
> the release notes.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22195
Usually, the smaller the messsage, the higher the CPU cost per each network byte
saved by compression, so it often makes sense to reserve heavier compression
for bigger messages (where it can make the biggest impact for a given CPU budget)
and use ligher compression for smaller messages.
There is a knob -- internode_compression_zstd_min_message_size -- which
excludes RPC messages below certain size from being compressed with zstd.
We arbitrarily set its default to 0 bytes before.
Now we want to arbitrarily set it to 1024 bytes.
This is based purely on intuition and isn't backed by any solid data.
Fixesscylladb/scylla-enterprise#4731Closesscylladb/scylla-enterprise#4990Closesscylladb/scylladb#22204
We still have a number of issues to be solved for views with tablets.
Until they are fixed, we should prevent users from creating them,
and use the vnode-based views instead.
This patch prepares the feature for enabling views with tablets. The
feature is disabled by default, but currently it has no effect.
After all tests are adjusted to use the feature, we should depend
on the feature for deciding whether we can create materialized views
in tablet-enabled keyspaces.
The unit tests are adjusted to enable this feature explicitly, and it's
also added to the scylla sstable tool config - this tool treats all
tables as if they were tablet-based (surprisingly, with SimpleStrategy),
so for it to work on views, the new feature must be enabled.
Refs scylladb/scylladb#21832Closesscylladb/scylladb#21833
The raft voters api implementation only allowed to make a node to be
a non-voter, but for the "limited voters" feature we need to
also have the option to make the node a voter (from within the topology
coordinator).
Modifying the api to allow both adding and removing voters.
This in particular tries to simplify the API by not having to add
another set of new functions to make a voter, but having a single setter
that allows to modify the node configuration to either become a voter or
a non-voter.
Fixes: scylladb/scylladb#21914
Refs: scylladb/scylladb#18793Closesscylladb/scylladb#21899
In case of error, repair will be moved into the end_repair stage. We
should not remove repair_task_info in this case because the repair task
requested by the user is not finished yet.
To fix, we should remove repair_task_info at the end of repair stage.
Tests are added to ensure failed repair is not reported as finished.
Closesscylladb/scylladb#21973
Avoid using temporary names and instead treat the final image tag
as a temporary.
The new procedure is more or less
remote-final := local-x86_64
local-aarch64 += remote-final
remote-final := local-aarch64 (which now contains the x86_64 image too)
Closesscylladb/scylladb#21981
So that both mutation and file streaming will have the same log for
tablet streaming which simplifies the dtest checking.
Closesscylladb/scylladb#22176
Test the limited voters feature by creating a cluster with 3 DCs, one of
them disproportionately larger than the others. The raft majority should
not be lost in case the large DC goes down.
Fixes: scylladb/scylla#21915
Refs: scylladb/scylla#18793Closesscylladb/scylladb#21901
mark the config parameter --commitlog-use-hard-size-limit as deprecated so the
default 'true' is always used, making the hard limit mandatory.
Fixesscylladb/scylladb#16471Closesscylladb/scylladb#21804
node_ops_virtual_task does not filter the entries of system.topology_request
and so it creates statuses of operations that aren't node ops.
Filter the entries used by node_ops_virtual_task. With this change, the status
of a bootstrap of the first node will not be visible.
Fixes: https://github.com/scylladb/scylladb/issues/22008.
Needs backport to 6.2 that introduced node_ops_virtual_task
Closesscylladb/scylladb#22009
* github.com:scylladb/scylladb:
test: truncate the table before node ops task checks
node_ops: rename a method that get node ops entries
node_ops: filter topology_requests entries
now that we are allowed to use C++23. we now have the luxury of using
`std::views::reverse`.
- replace `boost::adaptors::transformed` with `std::views::transform`
- remove unused `#include <boost/range/adaptor/reversed.hpp>`
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The block size of 1k is significantly increasing metadata overhead with xfs since it reserves space upfront for btree expansion. With CRC disabled, this reservation doesn't happen. Smaller btree blocks reduce the fanout factor, increasing btree height and the reservation size. So block size implies a trade-off between write amplification and metadata size. Bigger blocks, smaller metadata, more write ampl. Smaller blocks, more metadata, and less write ampl.
Let's disable both `rmapbt` and `relink` since we replicate data, and we can afford to rebuild a replica on local corruption.
Fixes: https://github.com/scylladb/scylladb/issues/22028Closesscylladb/scylladb#22072
Main problem:
If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.
The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.
Fixes#21826
Secondary problem:
We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state.
Third problem:
We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN.
The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes.
Closesscylladb/scylladb#21928
* github.com:scylladb/scylladb:
tablets: load_balancer: Fail when draining with no candidate nodes
tablets: load_balancer: Ignore skip_list when draining
tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained
This change is related to the unification of enterprise and open-source repositories.
The Sphinx configuration is updated to build documentation either for `docs.scylladb.com/manual` or `opensource.docs.scylladb.com`, depending on the flag passed to Sphinx.
By default, it will build docs for `docs.scylladb.com/manual`. If the `opensource` flag is passed, it will build docs for `opensource.docs.scylladb.com`, with a different set of versions.
This change will prepare the configuration to publish to `docs.scylladb.com/manual` while allowing the option to keep publishing and editing docs with a different multiversion configuration.
Note that this change will continue publishing docs to `opensource.docs.scylladb.com` for now since the `opensource` flag is being passed in the `gh-pages.yml` branch.
chore: remove comment
chore: update project name
Closesscylladb/scylladb#22089
Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve
performance and reduce dependency on Boost. `fmt::join()` allows direct
formatting of ranges and tuples with custom separators without creating
intermediate strings.
When formatting comma-separated values into another string, fmt::join()
avoids the overhead of temporary string creation that
`boost::algorithm::join()` requires. This change also helps streamline
our dependencies by leveraging the existing fmt library instead of
Boost.Algorithm.
To avoid the ambiguity, some caller sites were updated to call
`seastar::format()` explicitly.
See also
- boost::algorithm::join():
https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp
- fmt::join():
https://fmt.dev/11.0/api/#ranges-api
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22082
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.
Fix the method to check the task map local to the given shard.
Fixes: #22156.
Closesscylladb/scylladb#22161
in order to prevent future inclusion of unused headers, let's include
- mutation_writer
- node_ops
- redis
- replica
subdirectories to CLEANER_DIR, so that this workflow can identify the
regressions in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22050
The recent pull request https://github.com/scylladb/scylladb/pull/22031 introduced some regressions into the test/alternator framework. For a long time now, tests can create their own CQL roles for testing role-based features. But the new service levels test changed the "run" script and test.py's "suite.yaml" to create a new role and service level just for one test. This is not only ugly (the test code is now split to two places) and unnecessary, this setup also means that you can't run this test against an already-running copy of Scylla which wasn't prepared with the "right" role and service level. Even worse - the code that was added test/alternator/run was plain wrong - it used an outdated keyspace name (the code in suite.yaml was fine).
So in this patch I remove that extra run and suite.yaml code, and replace it by code inside the service level test to create the role and service level that it wants to test rather than assume it already exists.
While at it, I also removed a lot of duplicate and unnecessary code from this test.
After this patch, test/alternator/run returns to work correctly, after #22031 broke it.
This patch fixes a recent testing-framework regression, so doesn't need to be backported (unless that regression is backported).
Fixes#22047.
Closesscylladb/scylladb#22172
* github.com:scylladb/scylladb:
test/alternator: fix mistakes introduced with test_service_levels.py
test/alternator: move "cql" fixture to test/alternator/conftest.py
since Python 3.13, passing count to `re.sub()` as positional argument
has been deprecated. and when runnint `test.py` with Python 3.13, we
have following warning:
```
/home/kefu/dev/scylladb/./test.py:1540: DeprecationWarning: 'count' is passed as positional argument
args.modes = re.sub(r'.* List configured modes\n(.*)\n', r'\1',
```
see also https://github.com/python/cpython/issues/56166
in order to silence this distracting warning, let's pass
`count` using kwarg.
this change was created in the same spirit of c3be4a36af.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22085
Previously, `api/service_levels.hh` includes `api/api.hh` for
accessing symbols like `api/http_context`. but these symbols are
already available in a "smaller" header -- `api/api_init.hh`. so,
in order to improve the build efficiency, let's include smaller
headers in favor of "larger" ones.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22178
This patch adds tests for return consumed capacity for update_item.
The tests cover: a simple update for a small object, a missing item, an
update with a very large attribute (where the attribute itself is more
than 1KB), and an update of a big item that uses read-before-write.
test.py: Only check existence of Scylla executable
Previously, we had inconsistent behavior around missing executables:
- 561e88f0 added early failure if any executable was missing
- 8b7a5ca8 added a partial skip for combined_test, but didn't properly
handle build paths and artifacts
This change:
1. Moves executable existence check to PythonTestSuite class
2. Only adds combined_test suite when the executable exists
3. Eliminates redundant os.access() checks
4. Corrects the path to combined_test when checking for its existence
This allows running tests with a partial build while properly handling
missing executables, particularly for the combined_test suite.
Fixesscylladb/scylladb#22086
---
no need to backport, because the offending commit (8b7a5ca88d) is not included by any LTS branches yet.
Closesscylladb/scylladb#22163
* github.com:scylladb/scylladb:
test.py: Fix path checking for combined_test executable
test.py: Throw only if scylla executable is not found
This patch undoes multiple mistakes done when introducing the test
for service levels in pull request #22031:
1. The PR introduced in test/alternator/run and test/alternator/suite.yaml
a permanent role and service level that the service-level test is
supposed to use. This was a mistake - the test can create the service
level for its own use, using CQL, it does not need to assume such a
service level already exists.
It's important to fix this to allow the service level test to run
against an installation of Scylla not set up by our own scripts.
Moreover, while the code in suite.yaml was correct, the code in
"run" was incorrect (used an outdated keyspace name). This patch
removes that incorrect code.
2. The PR introduced a duplicate "cql" fixture, copied verbatim
from test_cql_rbac.py (including a comment that was correct only
in the latter file :-)). Let's de-duplicate it, using the fixture
that I moved to conftest.py in the previous patch.
3. The PR used temporary_grant(). This needelessly complicated the test
and added even more duplicate code, and this patch removes all that
stuff. This test is about service levels, not RBAC and "grant".
This test should just use a superuser role that has the permissions
to do everything, and don't need to be granted specific permissions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most Alternator test use only the DynamoDB API, not CQL. Tests in
test_cql_rbac.py did need CQL to set up roles and RBAC, so this file
introduced a "cql" fixture to make CQL requests.
A recently-introduced test/alternator/test_service_levels.py also
needs access to CQL - it currently uses it for misguided reasons but
the next patch will need it for creating a role and a service level.
So instead of duplicating this fixture, let's move this fixture into
test/alternator/conftest.py that all Alternator tests can share.
The next patch will clean up this duplication in test_service_levels.py
and the other mistakes it introduced.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
ICS is a compaction strategy that inherits size tiered properties --
therefore it's write optimized too -- but fixes its space overhead of
100% due to input files being only released on completion. That's
achieved with the concept of sstable run (similar in concept to LCS
levels) which breaks a large sstable into fixed-size chunks (1G by
default), known as run fragments. ICS picks similar-sized runs
for compaction, and fragments of those runs can be released
incrementally as they're compacted, reducing the space overhead
to about (number_of_input_runs * 1G). This allows user to increase
storage density of nodes (from 50% to ~80%), reducing the cost of
ownership.
NOTE: test_system_schema_version_is_stable adjusted to account for batchlog
using IncrementalCompactionStrategy
contains:
compaction/: added incremental_compaction_strategy.cc (.hh), incremental_backlog_tracker.cc (.hh)
compaction/CMakeLists.txt: include ICS cc files
configure.py: changes for ICS files, includes test
db/legacy_schema_migrator.cc / db/schema_tables.cc: fallback to ICS when strategy is not supported
db/system_keyspace: pick ICS for some system tables
schema/schema.hh: ICS becomes default
test/boost: Add incremental_compaction_test.cc
test/boost/sstable_compaction_test.cc: ICS related changes
test/cqlpy/test_compaction_strategy_validation.py: ICS related changes
docs/architecture/compaction/compaction-strategies.rst: changes to ICS section
docs/cql/compaction.rst: changes to ICS section
docs/cql/ddl.rst: adds reference to ICS options
docs/getting-started/system-requirements.rst: updates sentence mentioning ICS
docs/kb/compaction.rst: changes to ICS section
docs/kb/garbage-collection-ics.rst: add file
docs/kb/index.rst: add reference to <garbage-collection-ics>
docs/operating-scylla/procedures/tips/production-readiness.rst: add ICS section
some relevant commits throughout the ICS history:
commit 434b97699b39c570d0d849d372bf64f418e5c692
Merge: 105586f747 30250749b8
Author: Paweł Dziepak <pdziepak@scylladb.com>
Date: Tue Mar 12 12:14:23 2019 +0000
Merge "Introduce Incremental Compaction Strategy (ICS)" from Raphael
"
Introduce new compaction strategy which is essentially like size tiered
but will work with the existing incremental compaction. Thus incremental
compaction strategy.
It works like size tiered, but each element composing a tier is a sstable
run, meaning that the compaction strategy will look for N similar-sized
sstable runs to compact, not just individual sstables.
Parameters:
* "sstable_size_in_mb": defines the maximum sstable (fragment) size
composing
a sstable run, which impacts directly the disk space requirement which is
improved with incremental compaction.
The lower the value the lower the space requirement for compaction because
fragments involved will be released more frequently.
* all others available in size tiered compaction strategy
HOWTO
=====
To change an existing table to use it, do:
ALTER TABLE mykeyspace.mytable WITH compaction =
{'class' : 'IncrementalCompactionStrategy'};
Set fragment size:
ALTER TABLE mykeyspace.mytable WITH compaction =
{'class' : 'IncrementalCompactionStrategy', 'sstable_size_in_mb' : 1000 }
"
commit 94ef3cd29a196bedbbeb8707e20fe78a197f30a1
Merge: dca89ce7a5 e08ef3e1a3
Author: Avi Kivity <avi@scylladb.com>
Date: Tue Sep 8 11:31:52 2020 +0300
Merge "Add feature to limit space amplification in Incremental Compaction" from Raphael
"
A new option, space_amplification_goal (SAG), is being added to ICS. This option
will allow ICS user to set a goal on the space amplification (SA). It's not
supposed to be an upper bound on the space amplification, but rather, a goal.
This new option will be disabled by default as it doesn't benefit write-only
(no overwrites) workloads and could hurt severely the write performance.
The strategy is free to delay triggering this new behavior, in order to
increase overall compaction efficiency.
The graph below shows how this feature works in practice for different values
of space_amplification_goal:
https://user-images.githubusercontent.com/1409139/89347544-60b7b980-d681-11ea-87ab-e2fdc3ecb9f0.png
When strategy finds space amplification crossed space_amplification_goal, it
will work on reducing the SA by doing a cross-tier compaction on the two
largest tiers. This feature works only on the two largest tiers, because taking
into account others, could hurt the compaction efficiency which is based on
the fact that the more similar-sized sstables are compacted together the higher
the compaction efficiency will be.
With SAG enabled, min_threshold only plays an important role on the smallest
tiers, given that the second-largest tier could be compacted into the largest
tier for a space_amplification_goal value < 2.
By making the options space_amplification_goal and min_threshold independent,
user will be able to tune write amplification and space amplification, based on
the needs. The lower the space_amplification_goal the higher the write
amplification, but by increasing the min threshold, the write amplification
can be decreased to a desired amount.
"
commit 7d90911c5fb3fa891ad64a62147c3a6ca26d61b1
Author: Raphael S. Carvalho <raphaelsc@scylladb.com>
Date: Sat Oct 16 13:41:46 2021 -0300
compaction: ICS: Add garbage collection
Today, ICS lacks an approach to persist expired tombstones in a timely manner,
which is a problem because accumulation of tombstones are known to affecting
latency considerably.
For an expired tombstone to be purged, it has to reach the top of the LSM tree
and hope that older overlapping data wasn't introduced at the bottom.
The condition are there and must be satisfied to avoid data resurrection.
STCS, today, has an inefficient garbage collection approach because it only
picks a single sstable, which satisfies the tombstone density threshold and
file staleness. That's a problem because overlapping data either on same tier
or smaller tiers will prevent tombstones from being purged. Also, nothing is
done to push the tombstones to the top of the tree, for the conditions to be
eventually satisfied.
Due to incremental compaction, ICS can more easily have an effecient GC by
doing cross-tier compaction of relevant tiers.
The trigger will be file staleness and tombstone density, which threshold
values can be configured by tombstone_compaction_interval and
tombstone_threshold, respectively.
If ICS finds a tier which meets both conditions, then that tier and the
larger[1] *and* closest-in-size[2] tier will be compacted together.
[1]: A larger tier is picked because we want tombstones to eventually reach the
top of the tree.
[2]: It also has to be the closest-in-size tier as the smaller the size
difference the higher the efficiency of the compaction. We want to minimize
write amplification as much as possible.
The staleness condition is there to prevent the same file from being picked
over and over again in a short interval.
With this approach, ICS will be continuously working to purge garbage while
not hurting overall efficiency on a steady state, as same-tier compactions are
prioritized.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211016164146.38010-1-raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22063
Previously in 8b7a5ca88d, we checked for combined_test existence
without the "build" component in the path. This caused the test
suite to never find the executable, preventing the test cases'
cache from being populated.
Changes:
1. Use path_to() to check executable existence, which:
- Includes the "build" component in path
- Handles both CMake and configure.py build paths
2. Move existence check out of _generate_cache() for clarity
This ensures combined_test and its included tests are properly
discovered and run.
Fixesscylladb/scylladb#22086
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Previously, we had inconsistent behavior around missing executables:
- 561e88f0 added early failure if any executable was missing
- 8b7a5ca8 added a partial skip for combined_test, but didn't properly
handle build paths and artifacts
This change:
1. Moves executable existence check to PythonTestSuite class
3. Eliminates redundant os.access() checks
This allows running tests with a partial build while properly handling
missing executables, particularly for the combined_test suite.
In a succeeding change, we will correct the check for combined_tests.
Refs scylladb/scylladb#19489
Refs scylladb/scylladb#22086
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- utils: phased_barrier: advance_and_await: allocate new gate only when needed
- utils: phased_barrier: add close() method
- and use in existing services
* Improvement. No backport needed
Closesscylladb/scylladb#22018
* github.com:scylladb/scylladb:
utils: phased_barrier: add close() method
utils: phased_barrier: advance_and_await: allocate new gate only when needed
function.
The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.
diff calculation hashmap.
This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.
Fixesscylladb/scylladb#19101
conflicting keys.
The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.
This series introduces workload prioritization: an extension of the service levels feature which allows specifying "shares" per service level. The number of shares determines the priority of the user which has this service level attached (if multiple are attached then the one with the lowest shares wins).
Different service levels will be isolated in the following way:
- Each service level gets its own scheduling group with the number of shares (corresponding to the service level's number of shares), which controls the priority of the CPU and I/O used for user operations running on that service level.
- Each service level gets two reader concurrency semaphores, one for user reads and the other for read-before-write done for view updates.
- Each service level gets its own TCP connections for RPC to prevent priority inversion issues.
Because of the mandatory use of scheduling groups, which are a globally limited resource, the number of service levels is now limited to 7 user created service levels + 1 created by default that cannot be removed.
This feature has been previously only available in ScyllaDB Enterprise but has been made available for the source available ScyllaDB. The series was created by comparing the master branch with source-available-workbranch / enterprise branch and taking the workload prioritization related parts from the diff, then molding the resulting diff into a proper series. Some very minor changes were made such as fixing whitespace, removing unused or unnecessary code, adding some boilerplate (in api/) which was missing, but otherwise no major changes have been made.
No backport is required.
Closesscylladb/scylladb#22031
* github.com:scylladb/scylladb:
tracing: record scheduling group in trace event record
qos: un-shared-from-this standard_service_level_distributed_data_accessor
alternator: execute under scheduling group for service level
test.py: support multiple commands in prepare_cql in suite.yml
docs: add documentation for workload prioritization
docs/dev: describe workload prioritization features in service_levels
test/auth_cluster: test workload prioritization in service level tests
cqlpy/test_service_levels: add workload prioritization tests
api: introduce service levels specific API
api/cql_server_test: add information about scheduling group
db/virtual_tables: add scheduling group column to system.clients
test/boost: update service_level_controller_test for workload prio
qos: include number of shares in DESCRIBE
cql3/statements: update SL statements for workload prioritization
transport/server: use scheduling group assigned to current user
messaging_service: use separate set of connections per service levels
replica/database: add reader concurrency semaphore groups
qos: manage and assign scheduling groups to service levels
qos: use the shares field in service level reads/writes
qos: add shares to service_level_options
qos: explicitly specify columns when querying service level tables
db/system_distributed_keyspace: add shares column and upgrade code
db/system_keyspace: adjust SL schema for workload prioritization
gms: introduce WORKLOAD_PRIORITIZATION cluster feature
build: increase the max number of scheduling groups
qos: return correct error code when SL does not exist
Previously fatal errors like missing Minio executable were logged at INFO level,
which could be filtered out by log settings. Switch to ERROR level to ensure
these critical issues are always visible to developers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22084
Currently it should not happen because gossiper shadow round does
similar check, but we want to drop states that propagate through raft
from the gossiper eventually.
we updated tools/java/build.xml recently to only build for java-11. so
if
- the `java` executable in `$PATH` points to a java which is neither
java-8 nor java-11.
- java-8 is installed
java-8 is used to execute the cassandra-stress tool. and we would have
following failure:
```
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/cassandra/stress/Stress has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recogniz
es class file versions up to 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:621)
```
in order to be compatible with the bytecode targeting java-11, let's run
cassandra-stress with java-11. we do not need to support java-8, because
the new tools/java is now building cassandra-stress targeting java-11 jre.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22142
attribute server_broken_reason into the server was introduced, to store the raw information
regarding why the server was broken
additional information was added in the error messages in case of "server
broken"
fixes: #21630Closesscylladb/scylladb#22074
This string conversion functions are not in any fast path. Deinlining
them moves a <boost/lexical_cast.hpp> include out of a common header file.
Some files accessed on boost::iterator_range via lexical_cast.hpp,
so they gain a new dependency.
Closesscylladb/scylladb#21950
We have a "thread" field (unfortunately not yet displayed
in cqlsh, but visible in the table) that records the shard
on which a particular event was recorded. Record the scheduling
group as well, as this can be useful to understand where the
query came from.
(cherry picked from commit 3c03b5f66376dca230868e54148ad1c6a1ad0ee2)
Update `test_connections_parameters_auto_update` to also check that the
scheduling group of given connections is appropriately changed when a
different service level is assigned to the user that the connection uses
for authentication.
Apart from that, more tests are added:
- Check for the logic that forbids setting shares for a service level
until all nodes in the cluster are upgraded
- Test for handling the case when there are more scheduling groups than
it is allowed (it might happen after upgrade from a non-workload-prio
version)
- Regression test for a bug where less scheduling groups could have been
created than allowed due to some metrics not being renamed on
scheduling group name change.
Adjust existing cqlpy tests and add more in order to test the workload
prioritization feature:
- The DESCRIBE test is updated to check that generated statements
contain information about shares
- Two tests for shares in the LIST EFFECTIVE SERVICE LEVEL statement
- Regression test which checks that we can create as many service levels
as promised in the documentation (currently 7), but no more
- Test which checks that NULL shares in the service levels table are
treated as the default 1000 shares
Introduces two endpoints with operations specific to service levels:
- switch_tenants: updates the scheduling group of all connections to be
aligned with the service level specific to the logged in user. This is
mostly legacy API, as with service levels on raft this is done
automatically.
- count_connections: for each user and for each scheduling group, counts
how many connections are assigned to that user and scheduling group.
This API is used in tests.
Adjust some of the existing tests in service_level_controller_test.cc
and add some more in order to test the workload prioritization features,
i.e. the service level shares.
Now, the CREATE statements generated for each service level by the
DESCRIBE SCHEMA WITH INTERNALS statement will account for the service
level's shares.
Introduce the "SHARES" keyword which can be used in conjunction with
existing CQL statements related to the service levels.
Adjust the CQL statements for service levels:
- CREATE/ALTER now allow to set shares (only if the cluster is fully
upgraded)
- LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new
column
- LIST SERVICE LEVEL(S) also return the number of shares, and has the
additional column "percentage of all service level shares"
Now, when the user logs in and the connection becomes authenticated, the
processing loop of the connection is switched to the scheduling group
that corresponds to the service level assigned to the logged in user.
The scheduling group is also updated when the service level assigned to
this user changes.
Starting from this commit, the scheduling groups managed by the service
level controller are actually being used by user workload.
In order to make sure that the scheduling group carries over RPC, and
also to prevent priority inversion issues between different service
levels, modify the messaging service to use separate RPC connections for
each service level in order to serve user traffic.
The above is achieved by reusing the existing concept of "tenants" in
messaging service: when a new service level (or, more accurately,
service-level specific scheduling group) is first used in an RPC, a
new tenant is created.
In addition, extend the service level controller to be able to quickly
look up the service level name of the currently active scheduling group
in order to speed up the logic for choosing the tenant.
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.
Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.
The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
Introduce the core logic of workload prioritization, responsible for
assigning scheduling groups to service levels.
The service level controller maintains a pool of scheduling groups for
the currently present service levels, as well as a pool of unused
scheduling groups which were previously used by some service level that
was deleted during node's lifetime.
When a new service level is created, the SL controller either assigns a
scheduling group from the unused SG pool, or creates a new one if the
pool is empty. The scheduling group is renamed to "sl:<scheduling group
name>".
When updating shares of a service level (and also when creating a new
service level), the shares of the corresponding scheduling group are
synchronized with those of the service level.
When a service level is deleted, its group is released to the
aforementioned pool of unused scheduling groups and the prefix of its
name is changed from "sl:" to "sl_deleted:".
For now, these scheduling groups are not used by any user operations.
This will be changed in subsequent commits.
Add service level shares related fields to service_level_options and
slo_effective_names structs, and adjust the existing methods of the
former (merge_with, init_effective_names) to account for them.
The service levels table is queried with a `SELECT * ...` query, by
using the `execute_internal` method which prepares and caches the query
in an special cache for internal queries, separate from the user query
cache.
During rolling upgrade from a version which does not support service
level shares to the one that does, the `shares` column is added. The
aforementioned internal query cache is _not_ invalidated on schema
change, so the cache might still contain the prepared query from the
time before the column was added, and that prepared query will fetch the
old set of column without the new `shares` column.
In order to solve this, explicitly specify the columns in the query
string, using the full set of column names from the time when the query
is executed.
Note that this is a problem only for the legacy, non-raft service
levels. Raft-based service levels use a local table for which the schema
is determined on startup.
Also note that this code only fetches values from the `shares` column
but does not make any use of it otherwise. It will be handled by later
commits in this series.
Add the "shares" column to the
system_distributed_keyspace.service_levels table, which is used by
legacy code.
Because this table is in a distributed and not local keyspace, adding
the column to an existing cluster during rolling upgrade requires a bit
of care. A callback is added to the workload prioritization cluster
feature which runs when the feature becomes enabled and adds the column
for all nodes in the cluster.
Add a "shares" column which hold the number of shares allocated to
given service level.
It is not used by the code at all right now, subsequent commits will
make good use of it.
Information about the number of shares per service level will be stored
in an additional column in the service levels table, which is managed
through group0. We will need the feature to make sure that all nodes in
the cluster know about the new column before any node starts applying
group0 commands the would touch the new column.
This feature also serves a role for the legacy service levels
implementation that uses system_distributed for storage: after all nodes
are upgraded to support workload prioritization, one of the nodes will
perform a schema change operation and will add the new column.
Workload prioritization assigns scheduling groups to service levels, and
the number of scheduling groups that can exist at the same time is
limited with a compile-time parameter in seastar. The documentation for
workload prioritization says that we currently support 7 user-managed
service levels and 1 created by default. Increase the current
compile-time limit in order to align with the documentation.
The `nonexistant_service_level_exception` can be thrown by service
levels code and propagated up to the CQL server layer, where it is
converted into a CQL protocol error. The aforementioned exception
inherits from `service_level_argument_exception`, which in turn inherits
from `std::invalid_argument` - which doesn't mean much to the CQL layer
and is converted to a generic SERVER_ERROR.
We can do better and return a more meaningful error code for this
exception. Change the base class of service_level_argument_exception to
exceptions::invalid_request_exception which gets converted to an INVALID
error.
The INVALID error code was already being used by the enterprise version,
so this commit just synchronizes error handling with enterprise.
This adds to the grammar the option to SELECT a specific element in a collection (map/set/list).
For example:
`SELECT map['key'] FROM table`
`SELECT map['key1']['key2'] FROM table`
This feature was implemented in Cassandra 4.0 and was requested by scylla users.
The behavior is mostly compatible with Cassandra, except:
1. in SELECT, we allow list subscript in a selector, while cassandra allows only map and set.
2. in UPDATE, we allow set subscript in a column condition, while cassandra allows only map and list.
3. the slice syntax `SELECT m[a..b]` is not implemented yet
4. null subscript - `SELECT m[null]` returns null in scylla, while cassandra returns error
Fixes#7751
backport was requested for a user to be able to use it
Closesscylladb/scylladb#22051
* github.com:scylladb/scylladb:
cql3: allow SELECT of specific collection key
cql3: allow set subscript
The test no longer test anything since the address map is updated much
earlier now by the gossiper itself, not by the notifiers. The
functionality is tested by a unit test now.
Although the `network_topology_stratergy::make_replication_map` ->
`tablet_aware_replication_strategy::do_make_replication_map`
is not cpu intensive it still allocates and constructs a shared
`tablet_effective_replication_map`, and that might stall with
thousands of tablet-based tables.
Therefore coroutinize the preparation loop to allow yielding.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We already ignore a gossiper entries with host id equal to local host id
in raft mode since those entries are just outdated entries since before
ip change. The same logic applies to gossiper mode as well though, so do
the same in both modes.
Fixes: scylladb/scylladb#21930
Message-ID: <Z20kBZvpJ1fP9WyJ@scylladb.com>
This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:
After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.
A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.
All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.
This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.
Closesscylladb/scylladb#22032
* github.com:scylladb/scylladb:
messaging_service: use advanced_rpc_compression::tracker for compression
message/dictionary_service: introduce dictionary_service
service: make Raft group 0 aware of system.dicts
db/system_keyspace: add system.dicts
utils: add advanced_rpc_compressor
utils: add dict_trainer
utils: introduce reservoir_sampling
utils: introduce alien_worker
utils: add stream_compressor
Logging randomization parameters in the pytest_generate_tests hook doesn't
play well for us. To make these parameters more visible move the logging
to the test level.
Closesscylladb/scylladb#22055
This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git.
Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO)
to improve performance. LTO provides ~7% performance gain and enables crucial
binary layout optimizations for PGO.
LTO Changes:
- Add `-flto` flag to compile and link steps
- Use `-ffat-lto-objects` to generate both LLVM IR and machine code
- Enable cross-object optimization while maintaining fast test linking
PGO Implementation:
- Implement three-stage build process:
1. Context-free profiling (`-fprofile-generate`)
2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`)
3. Final optimization using merged profiles
- Add release-pgo and release-cs-pgo build stages
- Integrate with ninja build system
- Stages can be enabled independently
Profile Management:
- Add `pgo/pgo.py` for workload profile collection
- Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS
- Add configure.py integration for profile detection and validation
- Support custom profiles via `--use-profile` flag
- Add profile regeneration script
Both optimizations are recommended for maximum performance, though each PGO
stage adds a full build cycle. Future optimization may allow dropping one
PGO stage if performance impact is minimal.
---
this is a forward port, hence no need to backport.
Closesscylladb/scylladb#22039
* github.com:scylladb/scylladb:
build: cmake: add CMake options for PGO support
build: cmake: add "Scylla_ENABLE_LTO" option
build: set LTO and PGO flags for Seastar in cmake build
build: collect scylla libraries with `scylla_libs` variable
build: Unify Abseil CXX flags configuration
configure.py: prepare the build for a default PGO profile in version control
configure.py: introduce profile-guided optimization
pgo: add alternator workloads training
pgo: add a repair workload
pgo: add a counters workload
pgo: add a secondary index workload
pgo: add a LWT workload
pgo: add a decommission workload
pgo: add a clustering workload
pgo: add a basic workload
pgo: introduce a PGO training script
configure.py: don't include non-default modes in dist-server-* rules
configure.py: enable LTO in release builds by default
configure.py: introduce link-time optimization
configure.py: add a `default` to `add_tristate`.
configure.py: unify build rules for cxxbridge .cc files and regular .cc files
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.
Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786
This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().
- storage_service: replicate_to_all_cores: update base and view tables atomically
Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)
To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.
* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2
Closesscylladb/scylladb#21781
* github.com:scylladb/scylladb:
storage_service: replicate_to_all_cores: clear_gently pending erms
test_mv_topology_change: drop delay_after_erm_update injection case
storage_service: replicate_to_all_cores: update base and view tables atomically
table: make update_effective_replication_map sync again
This adds to the grammar the option to SELECT a specific key in a
collection column using subscript syntax.
For example:
SELECT map['key'] FROM table
SELECT map['key1']['key2'] FROM table
The key can also be parameterized in a prepared query. For this we need
to pass the query options to result_set_builder where we process the
selectors.
Fixesscylladb/scylladb#7751
In this patch we test the behavior of schema registry in a few
scenarios where it was identified it could misbehave.
The first one is reverse schemas for views. Previously, SELECT
queries with reverse order on views could fail because we didn't
have base info in the registry for such schemas.
The second one is schemas that temporarily died in the registry.
This can happen when, while processing a query for a given schema
version, all related schema_ptrs were destroyed, but this schema
was requested before schema_registry::grace_period() has passed.
In this scenario, the base info would not be recovered, causing
errors.
After the previous patches, the view schemas returned by schema registry
always have their base info set. As such, we no longer need to set it after
getting the view schema from the registry. This patch removes these
unnecessary updates.
The schema registry now holds base schemas for view schemas.
The base schema may change without changing the view schema, so to
preserve the change in the schema registry, we also update the
base schema in the registry when updating the base info in the
view schema.
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated.
To fix this, this patch adds the base schema to the registry, alongside
the view schema. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.
To store the base schema, the loader methods now have to return the base
schema alongside the view schema. At the same time, when loading into
the registry, we need to check whether we're loading a view schema, and if
so, we need to also provide the base schema. When inserting a regular table
schema, the base schema should be a disengaged optional.
In the following patches, we'll assure that view schemas returned by the
schema registry always have base info set. To prepare for that, make sure
that the base info is always set before inserting it into schema registry,
test.py: only access combined_tests executable if it is built
Fixes#22038Closesscylladb/scylladb#22069
* github.com:scylladb/scylladb:
test.py: only access combined_tests if it exists
test.py: rethrow CancelledError when executing a test
If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready.
This commit fixes this problem.
Fixes: scylladb/scylladb#12640Closesscylladb/scylladb#21774
`record_property` generates XML which is not compatible with xunit2,
so pytest decided to deprecated when the generating xunit reports.
and pytest generates following warning when a test failure is
reported using this fixture:
```
object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
```
this warning is not related to the test, but more about how we
report a failure using pytrest. it is distracting, so let's silence it.
See also https://github.com/pytest-dev/pytest/issues/5202
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22067
There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events.
Need to deselect them for now to make CI more stable. First batch deselected in https://github.com/scylladb/scylladb/pull/21658
Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit.
See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957
Closes scylladb/scylladb#22044
* github.com:scylladb/scylladb:
test.py: topology_random_failures: more deselects for #21534
test.py: topology_random_failures: handle more node's hangs during 30s sleep
This allows to use subscript on a set column, in addition to map/list
which was possible until now.
The behavior is compatible with Cassandra - a subscript with a specific value
returns the value if it's found in the set, and null otherwise.
When the scylla source tree is only partially built,
we still may want to run the tests.
test.py builds a case cache at boot, and executes
--list-cases for that, for all built tests.
After amalgamating boost unit tests into a single
file, it started running it unconditionally, which broke
partial builds.
Hence, only use combined_tests executable if it exists.
Fixes#22038
Commit 870f3b00fc,
"Add option to fail after number of failures" adds
tracking on the number of cancelled tests.
For the purpose, it intercepts CancelledError
and sets test's is_cancelled flag.
This introduced a regression reported in gh-21636:
Ctrl-C no longer works, since CancelledError is muted.
There was no intent to mute the exception,
re-throw it after accounting the test as cancelled.
This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`,
`dict_sampler` and `dictionary_service` in `main()`, and wires them to each other
and to `messaging_service`.
`messaging_service` compresses its network traffic with compressors managed by
the `advanced_rpc_compression::tracker`. All this traffic is passed as a single
merged "stream" through `dict_sampler`.
`dictionary_service` has access to `dict_sampler`.
On chosen nodes (by default: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the `alien_worker` thread) and publishes the new dictionary
to `system.dicts` via Raft.
This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes,
which updates the dictionary used by the compressors it manages.
- "Scylla_BUILD_INSTRUMENTED" option
Scylla_BUILD_INSTRUMENTED allows us to instrument the code at
different level, namely, IR, and CSIR. this option mirrors
"--pgo" and "--cspgo" options in `configure.py` . please note,
the instrumentation at the frontend is not supported, as the IR
based instrumentation is better when it comes to the use case of
optimization for performance.
see https://lists.llvm.org/pipermail/llvm-dev/2015-August/089044.html
for the rationales.
- "Scylla_PROFDATA_FILE" option
this option allows us to specify the profile data previous generated
with the "Scylla_BUILD_INSTRUMENTED" option. this option mirrors
the `--use-profile` option in `configure.py`, but it does not
take the empty option as a special case and consider it as a file
fetched from Git LFS. that will be handled by another option in a
follow-up change. please note, one cannot use
-DScylla_BUILD_INSTRUMENTED=PGO and -DScylla_PROFDATA_FILE=...
at the same time. clang just does not allow this. but CSPGO is fine.
- "Scylla_PROFDATA_COMPRESSED_FILE" option
this option allows us to specify the compressed profile data previouly
generated with the "Scylla_BUILD_INSTRUMENTED" option. along with
"Scylla_PROFDATA_FILE", this option mirros the functionality of
`--use-profile` in `configure.py`. the goal is to ensure user always
gets the result with the specified options. if anything goes wrong,
we just error out.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
add an option named "Scylla_ENABLE_LTO", which is off by default.
if it is on, build the whole tree with ThinLTO enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This change extends scylla commit 7cb74df to scylla-enterprise-commit
4ece7e1.
we recently started building Seastar as an external project, so
we need to prepare its compilation flags separately. in enterprise
scylla, we prepare the LTO and PGO related cflags in
`prepare_advanced_optimizations()`. this function is called when
preparing the build rules directly from `configure.py`, and despite
we have equivalant settings in CMake, they cannot be applied to Seastar
due to the reason above.
in this change, we set up the the LTO and PGO compilation flags when
generating the buiding system for Seastar when building using CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- Set ABSL_GCC_FLAGS and ABSL_LLVM_FLAGS with a more generic absl_cxx_flags
- Enables more flexible configuration of compiler flags for Abseil libraries
- Provides a centralized approach to setting compilation flags
Previously, sanitizer-specific flags were directly applied to Abseil library builds.
This change allows for more extensible compiling flag management across
different build configurations.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch adds the following logic to the release build:
pgo/profiles/profile.profdata.xz is the default profile file, compressed.
This file is stored in version control using git LFS.
A ninja rule is added which creates build/profile.profdata by decompressing it.
If no profile file is explicitly specified, ./configure.py checks whether
the compressed default profile file exists and is compressed.
(If it exists, but isn't compressed, the user most likely has
git lfs disabled or not installed. In this case, the file visible in the working
tree will be the LFS placeholder text file describing the LFS metadata.)
If the compressed file exists, build/profile.profdata is chosen as the used
profile file.
If it doesn't exist, a warning is printed and configure.py falls back
to a profileless build.
The default profile file can be explicitly disabled by passing the empty
--use-profile="" to configure.py
A script is added which re-generates the profile.
After the script is run, the re-generated compressed profile can be staged,
committed, pushed and merged to update the default profile.
This commit enables profile-guided optimizations (PGO) in the Scylla build.
A full LLVM PGO requires 3 builds:
1. With -fprofile-generate to generate context-free (pre-inlining) profile. This
profile influences inlining, indirect-call promotion and call graph
simplifications.
2. With -fprofile-use=results_of_build_1 -fcs-profile-generate to generate
context-sensitive (post-inlining) profile. This profile influences post-inline
and codegen optimizations.
3. With -fprofile-use=merged_results_of_builds_1_2 to build the final binary
with both profiles.
We do all three in one ninja call by adding release-pgo and release-cs-pgo
"stages" to release. They are a copy of regular release mode, just with the
flags described above added. With the full course, release objects depend on the
profile file produced by build/release-cs-pgo/scylla, while release-cs-pgo
depends on the profile file generated by build/release-pgo/scylla.
The stages are orthogonal and enabled with separate options. It's recommended
to run them both for full performance, but unfortunately each one adds a full
build of scylla to the compile time, so maybe we can drop one of them in the
future if it turns out e.g. that regular PGO doesn't have a big effect.
It's strongly recommended to combine PGO with LTO. The latter enables the entire
class of binary layout optimizations, which for us is probably the most
important part of the entire thing.
This patch adds a set of alternator workloads to pgo training
script.
To confirm that added workloads are indeed affecting profile we can compare:
⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/clustering/prof.profdata
Instrumentation level: IR entry_first = 0
Total functions: 105075
Maximum function count: 1079870885
Maximum internal block count: 2197851358
and
⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/alternator/prof.profdata
Instrumentation level: IR entry_first = 0
Total functions: 105075
Maximum function count: 5240506052
Maximum internal block count: 9112894084
to see that function counters are on similar levels, they are around 5x higher for alternator
but that's because it combines 5 specific sub-workloads.
To confirm that final profile contains alterantor functions we can inspect:
⤖ llvm-profdata show --counts --function=alternator --value-cutoff 100000 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR entry_first = 0
Functions shown: 356
Total functions: 105075
Number of functions with maximum count (< 100000): 97275
Number of functions with maximum count (>= 100000): 7800
Maximum function count: 7248370728
Maximum internal block count: 13722347326
we can see that 356 functions which symbol name contains word alternator were identified as 'hot' (with max count grater than 100'000). Running:
⤖ llvm-profdata show --counts --function=alternator --value-cutoff 1 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR entry_first = 0
Functions shown: 806
Total functions: 105075
Number of functions with maximum count (< 1): 67036
Number of functions with maximum count (>= 1): 38039
Maximum function count: 7248370728
Maximum internal block count: 13722347326
we can see that 806 alternator functions were executed at least once during training.
And finally to confirm that alternator specific PGO brings any speedups we run:
for workload in read scan write write_gsi write_rmw
do
./build/release/scylla perf-alternator-workloads --smp 4 --cpuset "10,12,14,16" --workload $workload --duration 1 --remote-host 127.0.0.1 2> /dev/null | grep median
done
results BEFORE:
median 258137.51910849303
median absolute deviation: 786.06
median 547.2578202937141
median absolute deviation: 6.33
median 145718.19856685458
median absolute deviation: 5689.79
median 89024.67095807113
median absolute deviation: 1302.56
median 43708.101729598646
median absolute deviation: 294.47
results AFTER:
median 303968.55333940056
median absolute deviation: 1152.19
median 622.4757636209254
median absolute deviation: 8.42
median 198566.0403745328
median absolute deviation: 1689.96
median 91696.44912842038
median absolute deviation: 1891.84
median 51445.356525664996
median absolute deviation: 1780.15
We can see that single node cluster tps increase is typically 13% - 17% with notable exceptions,
improvement for write_gsi is 3% and for write workload whopping 36%.
The increase is on top of CQL PGO.
Write workload is executed more often because it's involved also as data preparation for read and scan.
Some further improvement could be to separate preparation from training as it's done for CQL but it would
be a bit odd if ~3x higher counters for one flow have so big impact.
Additional disclaimers:
- tests are performing exactly the same workloads as in training so there might be some bias
- tests are running single node cluster, more realistic setup will likely show lower improvement
Fixes https://github.com/scylladb/scylla-enterprise/issues/4066
This workload is added to teach PGO about repair.
Tests are inconclusive about its alignment with existing workloads,
because repair doesn't seem utilize 100% of the reactor.
This workload is added to teach PGO about counters.
Tests seem to show it's mostly aligned with existing CQL workloads.
The config YAML is based on the default cassandra-stress schema.
This workload is added to teach PGO about secondary indexes.
Tests seem to show that it's mostly aligned with existing CQL workloads.
The config YAML was copied from one of scylla-cluster-test test cases.
This workload is added to teach PGO about LWT codepaths.
Tests seem to show that it's mostly aligned with existing CQL workloads.
The config YAML was copied from one of scylla-cluster-tests test cases.
This workload is added to teach PGO about streaming.
Tests show that this workload is mostly orthogonal to CQL workloads
(where "orthogonal" means that training on workload A doesn't improve workload
B much, while training on workload A doesn't improve workload B much),
so adding it to the training is quite important.
In contrast to the basic workload, this workload uses clustering
keys, CK range queries, RF=1, logged batches, and more CQL types.
Tests seem to show that this workload is mostly aligned with the existing basic
workload (where "aligned" means that training on workload A improves workload B
about as much as training on workload B).
The config YAML is based on the example YAML attached to cassandra-stress
sources.
Profile-guided optimization consists of the following steps:
1. Build the program as usual, but with with special options (instrumentation
or just some supplementary info tables, depending on the exact flavor of PGO
in use).
2. Collect an execution profile from the special binary by running a
training workload on it.
3. Rebuild the program again, using the collected profile.
This commit introduces a script automating step 2: running PGO training workloads
on Scylla. The contents of training workloads will be added in future commits.
The changes in configure.py responsible for steps 1. and 3. will also appear
in future commits.
As input, the script takes a path to the instrumented binary, a path to a
the output file, and a directory with (optionally) prepopulated datasets for use
in training. The output profile file can be then passed to the compiler to
perform a PGO build.
The script current supports two kinds of PGO instrumentation: LLVM instrumentation
(binary instrumented with -fprofile-generate and -fcs-profile-generate passed to
clang during compilation) and BOLT instrumentation (binary instrumented with
`llvm-bolt -instrument`, with logs from this operation saved to
$binary_path.boltlog)
The actual training workloads for generating the profile will be added in later
commits.
This patch introduces link-time optimization (LTO) to the build.
The performance gains from LTO alone are modest (~7%), but it's vital ingredient
of effective profile-guided optimization, which will be introduced later.
In general, use of LTO is quite simple and transparent to build systems.
It is sufficient to add the -flto flag to compile and link steps, and use a
LTO-aware linker.
At compile time, -ffat-lto-objects will cause the compiler to emit .o
files both LTO-ready LLVM IR for main executable optimization and machine
code for fast test linking. At link time, those pieces of IR will be
compiled together, allowing cross-object optimization of the main
executable and the fast linking of test executables.
Due to it's high compile time cost, the optimization can be toggled with a
configure.py option. As of this patch, it's disabled by default.
We know the number of positions in advance
so reserve the chunked_vector capacity for that.
Note: reservation replaces the existing reset of the
positions member. This is safe since we parse the summary
only once as sstable::read_summary() returns early
if the summary component is already populated.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21767
Until today, when we had a PR with multiple commits we cherry-pick the merge commit only, which created a PR with only one commit (the merge commit) with all relevant changes
This was causing an issue when there was a need to backport part of the commits like in https://github.com/scylladb/scylladb/pull/21990 (reported by @gleb-cloudius)
Changing the logic to cherry-pick each commit
Closesscylladb/scylladb#22027
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::to`.
in this change, we:
- replace `boost::copy_range` to `std::ranges::to`
- remove unused `#include` of boost headers
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21880
When embedding HTML documents in pytest reports with links to test artifacts,
parameterized test names containing special characters like "[" and "]" can
cause URL encoding issues. These characters, when used verbatim in URLs, can
trigger HTTP 400 errors on web servers.
This commit resolves the issue by percent-encoding the URLs for artifact links,
ensuring compatibility with servers like Jenkins and preventing "HTTP ERROR 400
Illegal Path Character" errors.
Changes:
- Percent-encode test artifact URLs to handle special characters
- Improve link robustness for parameterized test names
Fixesscylladb/scylla-pkg#4599
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21963
When services are stopped we generally want to call
advance_and_await(), but we should also prevent starting
new operations, so close() would do that be closing the
phased_barrier active gate (which implicitly also awaits
past operations similar to advance_and_await()).
Add unit tests for that and use in existing services.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If there are no opearions in progress, there is no
need to close the current gate and allocate a new one.
The current gate can be reused for the new phase just as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This enables tablets in topology_custom, so explicitly
disable them where tests don't support tablets.
In scope of this rename patch a few imports.
Importing dependencies from another test is a bad idea -
please use shared libraries instead.
Fixed#20193Closesscylladb/scylladb#22014
In 2e6755ecca I have added a comment when PR has conflicts so the assignee can get a notification about it. There was a problem with the user mention param (a missing `.login`)
Fixing it
Closesscylladb/scylladb#22036
The main purpose of this change is to enhance the restore from object storage usage.
Currently, restore uses the load-and-stream facility. When triggered, the restoring task opens the provided list of sstables directory from the remote bucket and then feeds the list of sstables to load_and_stream() method. The method, in turn, iterates over this list, reads mutations and for each mutation decides where to send one by checking the replication map (it's pretty much the same for both vnodes and tablets, but for tablets that are "fully contained" by a range there's the plan to stream faster).
As described above, restore is governed by a single node and this single node reads all sstables from the object store, which can be very slow. This PR allows speeding things up. For that, the load-and-stream code is equipped with the "scope" filter which limits where mutations can be streamed to. There are four options for that -- all, dc, rack and node. The "all" is how things work currently, "dc" and "rack" filter out target nodes that don't belong to this node's dc/rack respectively. The "node" scope only streams mutations to local node.
With the "node" scope it's possible to make all nodes in the cluster load mutations that belong to them in parallel, without re-sending them to peers. The last patch in this PR is the test that shows how it can be possible.
Closesscylladb/scylladb#21169
* github.com:scylladb/scylladb:
test: Add scope-streaming test (for restore from backup)
api: New "scope" API param to load-and-stream calls
sstables_loader: Propagate scope from API down
sstables_loader: Filter tablets based on scope
streamer: Disable scoped streaming of primary replica only
sstables_loader: Introduce streaming scope
sstables_loader: Wrap get_endpoints()
It turned out that aforementioned APIs use slightly different sources of information about view build progress/status which sometimes results in different reporting of whether an index is built. It's good to make those two APIs consistent. Also add a test for the REST API endpoint (system table test was addressed by #21677).
Closesscylladb/scylladb#21814
* github.com:scylladb/scylladb:
test: Add tests for MVs and indexes reporting by API endpoint(s)
api: Use built_views table in get_built_indexes API
The node is hanging and the coordinator just rollback a topology state. It's different from
`stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration`
because in these cases the coordinator just not able to start the topology change at all and
a message in the coordinator's log is different.
Error injections handled:
- `stop_after_updating_cdc_generation`
- `stop_before_streaming`
And, actually, it can be any cluster event which lasts more than 30s.
This series re-implements locator::topology::sort_by_proximity
and adds some randomization to shuffle equal-distance replicas for improving load-balancing
when reading with 1 < consistency level < replication factor.
This change also adds a manual test for benchmarking sort_by_proximity,
as it's not exercised by the single-node perf-simple-query.
The benchmark shows performance improvement of over 20% (from about 71 ns to 56 ns
per call for 3 nodes vectors), mainly due to "calculate distance only once" which
pre-calculates the distance from the reference node for each replica once, rather than
each time to comparator is called by std::sort
* Improvement. No backport needed
Closesscylladb/scylladb#21958
* github.com:scylladb/scylladb:
locator/topology: do_sort_by_proximity: shuffle equal-distance replicas
locator/topology: sort_by_proximity: calculate distance only once
utils: small_vector: expose internal_capacity()
storage_proxy: sort_endpoints_by_proximity: lookup my_id only if cannot sort by proximity
test/perf: add perf_sort_by_proximity benchmark
locator: refactor sort_by_proximity
So far there's the /column_family/built_indexes one that reports the
index names similar to how system.IndexInfo does, but it's not tested.
This patch adds tests next to existing system. table ones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Somehow system."IndexInfo" table and column_family/built_indexes REST
API endpoint declare an index "built" at slightly different times:
The former a virtual table which declares an index completely built
when it appears on the system.built_views table.
The latter uses different data -- it takes the list of indexes in
the schema and eliminates indexes which are still listed in the
system.scylla_views_builds_in_progress table.
The mentioned system. tables are updated at different times, so API
notices the change a bit later. It's worth improving the consistency
of these two APIs by making the REST API endpoint piggy-back the
load_built_views() instead of load_view_build_progress(). With that
change the filtering of indexes should be negated.
Fixes#21587
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To improve balancing when reading in 1 < CL < ALL
This implementation has a moderate impact on
the function performance in contrast to full
std::shuffle of the vector before stable_sort:ing it
(especially with large number of nodes to sort).
Before:
test iterations median mad min max allocs tasks inst cycles
sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6
After:
sort_by_proximity_topology.perf_sort_by_proximity 19689561 50.195ns 0.119ns 50.076ns 51.145ns 0.000 0.000 622.5 150.6
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use a temporary vector to use the precalculated distances.
A later patch will add some randomization to shuffle nodes
at the same distance from the reference node.
This improves the function performance by 50% for 3 replicas,
from 77.4 ns to 39.2 ns, larger replica sets show greater improvement
(over 4X for 15 nodes):
Before:
test iterations median mad min max allocs tasks inst cycles
sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6
After:
sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can use it for defining other small_vector
deriving their internal capacity from another small_vector
type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
topology::sort_by_proximity already sorts the local node
address first, if present, so look it up only when
using SimpleSnitch, where sort_by_proximity() is a no-op.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
benchmark sort_by_proximity
Baseline results on my desktop for sorting 3 nodes:
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 1
random seed: 20241224
test iterations median mad min max allocs tasks inst cycles
sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
this changeset includes two changes to silence the warnings reported by shellcheck. This changeset has no functional impact and serves as a proactive code improvement.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21756
* github.com:scylladb/scylladb:
install-dependencies.sh: quote array to avoid re-splitting
install-dependencies.sh: define local variable using "local -A"
This "service" is a bag for code responsible for dictionary training,
created to unclutter main() from dictionary-specific logic.
It starts the RPC dictionary training loop when the relevant cluster feature is enabled,
pauses and unpauses it appropriately whenever relevant config or leadership
status are updated, and publishes new dictionaries whenever the training fiber produces them.
Adds glue which causes the contents of system.dicts to be sent
in group 0 snapshots, and causes a callback to be called when
system.dicts is updated locally. The callback is currently empty
and will be hooked up to the RPC compressor tracker in one of the
next commits.
Adds a new system table which will act as the medium for distributing
compression dictionaries over the cluster.
This table will be managed by Raft (group 0). It will be hooked up to it in
follow-up commits.
Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries
as the network traffic compressors for Seastar's RPC servers.
The main jobs of this glue are:
1. Implementing the API expected by Seastar from RPC compressors.
2. Expose metrics about the effectiveness of the compression.
3. Allow dynamically switching algorithms and dictionaries on a running
connection, without any extra waits.
The biggest design decision here is that the choice of algorithm and dictionary
is negotiated by both sides of the connection, not dictated unilaterally by the
sender.
The negotiation algorithm is fairly complicated (a TLA+ model validating
it is included in the commit). Unilateral compression choice would be much simpler.
However, negotiation avoids re-sending the same dictionary over every
connection in the cluster after dictionary updates (with one-way communication,
it's the only reliable way to ensure that our receiver possesses the dictionary
we are about to start using), lets receivers ask for a cheaper compression mode
if they want, and lets them refuse to update a dictionary if they don't think
they have enough free memory for that.
In hindsight, those properties probably weren't worth the extra complexity and
extra development effort.
Zstd can be quite expensive, so this patch also includes a mechanism which
temporarily downgrades the compressor from zstd to lz4 if zstd has been
using too much CPU in a given slice of time. But it should be noted that
this can't be treated as a reliable "protection" from negative performance
effects of zstd, since a downgrade can happen on the sender side,
and receivers are at the mercy of senders.
We are planning to improve some usages of compression in Scylla
(in which we compress small blocks of data) by pre-training
compression dictionaries on similar data seen so far.
For example, many RPC messages have similar structure
(and likely similar data), so the similarity could be exploited
for better compression. This can be achieved e.g. by training
a dictionary on the RPC traffic, and compressing subsequent
RPC messages against that dictionary.
To work well, the training should be fed a representative sample
of the compressible data. Such a sample can be approached by
taking a random subset (of some given reasonable size) of the data,
with uniform probability.
For our purposes, we need an online algorithm for this -- one
which can select the random k-subset from a stream of arbitrary
size (e.g. all RPC traffic over an hour), while requiring only
the necessary minimum of memory.
This is a known problem, called "reservoir sampling".
This PR introduces `reservoir_sampler`, which implements
an optimal algorithm for reservoir sampling.
Additionally, it introduces `page_sampler` -- a wrapper for `reservoir_sampler`,
which uses it to select a random sample of pages from a stream of bytes.
Introduces a util which launches a new OS thread and accepts
callables for concurrent execution.
Meant to be created once at startup and used until shutdown,
for running nonpreemptible, 3rd party, non-interactive code.
Note: this new utility is almost identical to wasm::alien_thread_runner.
Maybe we should unify them.
Adds utilities for "advanced" methods of compression with lz4
and zstd -- with streaming (a history buffer persisted across messages)
and/or precomputed dictionaries.
This patch is mostly just glue needed to use the underlying
libraries with discontiguous input and output buffers, and for reusing the
same compressor context objects across messages. It doesn't contain
any innovations of its own.
There is one "design decision" in the patch. The block format of LZ4
doesn't contain the length of the compressed blocks. At decompression
time, that length must be delivered to the decompressor by a channel
separate to the compressed block itself. In `lz4_cstream`, we deal
with that by prepending a variable-length integer containing the
compressed size to each compressed block. This is suboptimal for
single-fragment messages, since the user of lz4_cstream is likely
going to remember the length of the whole message anyway,
which makes the length prepended to the block redundant.
But a loss of 1 byte is probably acceptable for most uses.
- create
- a cluster with given topology
- keyspace with tablets and given rf value
- table with some data
- backup
- flush all nodes
- kick backup API on every node
- re-create keyspace and table
- drop it first
- create again with the same parameters and schema, but don't
populate table with data
- restore
- collect nodes to contact and corresponding list of TOCs
according to the preferred "scope"
- ask selected nodes to restore, limiting its streaming scope
and providing the specific list of sstables
- check
- select mutation fragments from all nodes for random keys
- make sure that the number of non-empty responses equals the
expected rf value
Specific topologies, RFs and stream scopes used are:
rf = 1, nodes = 3, racks = 1, dcs = 1, scope = node
rf = 3, nodes = 5, racks = 1, dcs = 1, scope = node
rf = 1, nodes = 4, racks = 2, dcs = 1, scope = rack
rf = 3, nodes = 6, racks = 2, dcs = 1, scope = rack
rf = 3, nodes = 6, racks = 3, dcs = 1, scope = rack
rf = 2, nodes = 8, racks = 4, dcs = 2, scope = dc
nodes and racks are evenly distributed in racks and dcs respectively
in the last topo RF effectively becomes 4 (2 in each dc)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of those -- the POST /storage_service/keyspace that loads
and streams new sstables from /upload and POST /storage_service/restore
that does the same, but gets sstables from object store.
The new optional parameter allow users to tun the streaming phase
behavior. The test/pylib client part is also updated here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Semi-mechanical change that adds newly introduced "scope" parameter to
all the functions between API methods and the low-level streamer object.
No real functional changes. API methods set it to "all" to keep existing
behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Loading and streaming tablets has pre-filtering loop that walks the
tablet map sorts sstables into three lists:
- fully contained in one of map ranges
- partially overlapping with the map
- not intersecting with the map
Sstables from the 3rd list is immediately dropped from the process and
for the remaining two core load-and-stream happens.
This filtering deserves more care from the newly introduced scope. When
a tablet replica set doesn't get in the scope, the whole entry can be
disregarded, because load-and-stream will only do its "load" part anyway
and all mutations from it will be ignored.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's been some discussions of how primary replica only streaming
schould interact with the scope. There are two options how to consider
this combination:
- find where the primary replica is and handle it if it's within the
requested sope
- within the requested scope find the primary replica for that subset of
nodes, then handle it
There's also some itermediate solution: suppoer "primary replica in DC"
and reject all other combinations.
Until decided which way is correct, let's disable this configuration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently load-and-stream sends mutations to whatever node is considered
to be a "replica" for it. One exception is the "primary-replica-only"
flag that can be requested by the user.
This patch introduces a "scope" parameter that limits streaming part in
where it can stream the data to with 4 options:
- all -- current way of doing things, stream to wherever needed
- dc -- only stream to nodes that live in the same datacenter
- rack -- only stream to nodes that live in the same rack
- node -- only "stream" to current node
It's not yet configurable and streamer object initializes itself with
"all" mode. Will be changed later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Preparational patch. Next will add more code to get_endpoints() that
will need to work for both if/else branches, this change helps having
less churn later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extract can_sort_by_proximity() out so it can be used
later by storage_proxy, and introduce do_sort_by_proximity
that sorts unconditionally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To reduce test executable size and speed up compilation time, compile unit
tests into a single executable.
Here is a file size comparison of the unit test executable:
- Before applying the patch
$ du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
11G build/release/test/boost/
29G build/debug/test/boost/
- After applying the patch
du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
5.5G build/release/test/boost/
19G build/debug/test/boost/
It reduces executable sizes 5.5GB on release, and 10GB on debug.
Closes#9155Closesscylladb/scylladb#21443
Currently, `get_network_topology_options()` is using gossip data
and iterates over topology using IPs and not host IDs, which may
result in operating on inconsistent data.
This method's implemenations has been changed to instead use
`get_datacenters()`, which should always return consistent data.
Fixes: scylladb/scylladb#21490Closesscylladb/scylladb#21940
Where the grammar supports IN, we add NOT IN. This includes the WHERE
clause and LWT IF clause.
Evaluation of NOT IN follows from IN.
In statement_restrictions analysis, they are different, as NOT IN
doesn't enable any clever query plan and must filter.
Some tests are added. An error message was changed ('in' changed to 'IN'),
so some tests are adjusted.
Closesscylladb/scylladb#21992
we already check `self.cmd` for null at the very beginning of the
`ScyllaServer.stop()`, and in the `try` block, we don't reset
`self.cmd`, hence there is no need to check it again.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21936
`prefixed()` is a static function in `mutation_partition_v2.cc`.
and this function is not used in this translation unit. so let's
remove it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22006
Currently, in tablet_map_to_mutation, repair's and migration's
tablet_task_info is always set.
Do not set the tablet_task_info if there is no running operation.
Closesscylladb/scylladb#22005
This patch adds WCU support for update_item. The way Alternator modifies
values means we don't always have the full item sizes. When there is a
read-before-write, the code in rmw_operation takes care of the object
size.
When updating a value without read-before-write, we will make a rough
estimation of the value's size. This is better than simply taking 1 (as
we do with delete) and is also more Alternator-like.
During the consolidation of per-suite pytest.ini files (commit 8bf62a086f),
the 'repair' marker was inadvertently dropped. This led to pytest warnings
for tests using the @pytest.mark.repair decorator.
This patch restores the marker declaration to eliminate the distracting
PytestUnknownMarkWarning:
```
test/topology_experimental_raft/test_tablets.py:396
/home/kefu/dev/scylladb/test/topology_experimental_raft/test_tablets.py:396: PytestUnknownMarkWarning: Unknown pytest.mark.repair - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@pytest.mark.repair
```
Restoring the marker allows tests to use the 'repair' mark without
generating warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21931
This series converts the call site using compare_endpoints with gms::inet_address.
With that both flavors of compare_endpoints and sort_by_proximity for inet_address
can be retired as no other uses remain.
Also, add a unit test for topology::sort_by_proximity before further changes
to it are considered.
* Code cleanup, no backport is needed
Closesscylladb/scylladb#21976
* github.com:scylladb/scylladb:
test: network_topology_strategy_test: add test_topology_sort_by_proximity
locator/topology: retire sort_by_proximity/compare_endpoints for inet_address
test: test_topology_compare_endpoints: use host_id:s
Currently node_ops_virtual_task shows stats of all system.topology_request
entries. However, the table also contains info about non-node_ops requests,
e.g. truncate.
Filter the entries used by node_ops_virtual_task by their type.
With this change bootstrap of the first node will not be visible.
Update the test accordingly.
Today our container is based on ubuntu:22.04, we need to build another container based on Ubuntu Pro for FIPS support (currently the latest one is 20.04)
The default docker build process doesn't change, if FIPS is required I have added `--type pro` to build a supported container.
To enable FIPS there is a need to attach an Ubuntu Pro subscription (it will be done as part of https://github.com/scylladb/scylla-pkg/issues/4186)
Closesscylladb/scylladb#21974
Similar to 9ace191616 (repair: Enable
small table optimization for RBNO bootstrap and decommission), this
patch enables small table optimization for RBNO rebuild.
This is useful for rebuild ops which is used for building an empty DC.
Fixes: #21951Closesscylladb/scylladb#21952
instead of reusing the variable name and overriding the parameter,
use a new name for the return value of `manager_internal()` for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21932
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21837
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.
Added a testcase to verify the fix.
Fixes#21887
This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.
Closesscylladb/scylladb#21895
* github.com:scylladb/scylladb:
sstables_manager: reclaim memory from sstables on unlink
sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
sstables: introduce disable_component_memory_reload()
sstables_manager: log sstable name when reclaiming components
In commit bfee93c7, repair verbs were moved to IDL. During this refactoring,
the `gc_clock.hh` header became unused as its references were relocated.
`clang-include-cleaner` helped identify this unnecessary include, which is
now removed to clean up the codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21919
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21838
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::stable_partition`.
in this change, we:
- replace `boost::range::stable_parition()` to
`std::ranges::stable_parition()`
- since `std::ranges::stable_parition()` returns a subrange instead of
an iterator, change the names of variables which were previously used
for holding the return value of `boost::range::stable_partition()`
accordingly for better readability.
- remove unused `#include` of boost headers
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21911
Currently, when finishing db::view::calculate_affected_clustering_ranges
we deoverlap, transform and copy all ranges prepared before. This
is all done within a single continuation and can cause stalls.
We fix this by adding yields after each transform and moving elements
to the final vector one by one instead of copying them all at the end.
After this change, the longest continuation in this code will be
deoverlapping the initial ranges (and one transform). While it has
a relatively high computational complexity (we sort all ranges), it
should execute quickly because we're operating on views there and
we don't need to copy the actual bytes. If we encounter a stall there,
we'll need to implement an asynchronous `deoverlap` method.
Fixesscylladb/scylladb#21843Closesscylladb/scylladb#21846
The series contains small fixes to the gossiper one of which fixes#21930. Others I noticed while debugged the issue.
Fixes: scylladb/scylladb#21930Closesscylladb/scylladb#21956
* github.com:scylladb/scylladb:
gossiper: do not reset _just_removed_endpoints in non raft mode
gossiper: do not send echo message to yourself
gossiper: do not call apply for the node's old state
Fixes#20717
Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()".
Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader.
There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored.
Closesscylladb/scylladb#21567
* github.com:scylladb/scylladb:
test_backup: Add restore abort test case
sstables_loader: Make restore task abortable
distributed_loader: Add optional abort_source to get_sstables_from_object_store
s3_storage: Add optional abort_source to params/object
s3::client: Make "readable_file" abortable
Those are not used anymore now that the last call
site for compare_endpoints by inet_address is converted
to use host_id.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is the last call site requiring the compare_endpoints
flavour for inet_address.
Once this test is converted to use host_id:s instead,
compare_endpoints and sort_by_proximity can be simplified
to support only host_id:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When running clang-include-cleaner, the tool performs static analysis by
"compiling" specified source files. Previously, non-existent included headers
caused the tool to skip source files, reducing the effectiveness of unused
include detection.
Problem:
- Header files like 'rust/wasmtime_bindings.hh' were not pre-generated
- Compilation errors led to skipping source file analysis
```
/__w/scylladb/scylladb/lang/wasm.hh:15:10: fatal error: 'rust/wasmtime_bindings.hh' file not found
15 | #include "rust/wasmtime_bindings.hh"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Skipping file /__w/scylladb/scylladb/lang/wasm.hh due to compiler errors. clang-include-cleaner expects to work on compilable source code.
1 error generated.
```
- This significantly reduced clang-include-cleaner's coverage
Solution:
- Build the `wasmtime_bindings` target to generate required header files
- Ensure all necessary headers are created before running static analysis
- Enable full source file checking for unused includes
By generating headers before analysis, we prevent skipping of source files
and improve the comprehensiveness of our include cleaner workflow.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21739
This test was added in PR #19789 but was disabled with xfail because of
the bug with way truncate saved the commit log replay positions. More
specifically, the replay positions for shards that had no mutations were
saved to system.truncated with shard_id == 0, regardless for which shard
it was actually saved for (see #21719).
The bug was fixed in #21722, so this change removes the xfail tag from
the test.
Closesscylladb/scylladb#21902
The data resolved has to apply all mutations from all replica to a
single mutation. In the extreme case, when all rows are dead, the
mutations can have around 10K rows in them. This is not a huge amount,
but it is enough to cause moderate stalls of <20ms.
To avoid this, use the gentle variant of apply(), which can yield in the
middle.
Fixes: scylladb/scylladb#21818Closesscylladb/scylladb#21884
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21836
"
This series converts repair, streaming and node_ops (and some parts of
alternator) to work on host ids instead of ips. This allows to remove
a lot of (but not all) functions that work on ips from effective
replication map.
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/
Refs: scylladb/scylladb#21777
"
* 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev:
locator: topology: remove no longer use get_all_ips()
gossiper: change get_unreachable_nodes to host ids
locator: drop no longer used ip based functions from effective replication map and friends
test: move network_topology_strategy_test and token_metadata_test to use host id based APIs
replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges
replica/mutation_dump: use host ids instead of ips
alternator: move ttl to work with host ids instead of ips
storage_service: move node_ops code to use host ids instead of host ips
streaming: move streaming code to use host ids instead of host ips
repair: move repair code to use host ids instead of host ips
gossiper: add get_unreachable_host_ids() function
locator: topology: add more function that return host ids to effective replication map
locator: add more function that return host ids to effective replication map
This patch includes more tests (in Python) that I wrote while implementing the Alternator UpdateTable feature for adding a GSI to an existing table (https://github.com/scylladb/scylladb/issues/11567).
I explain each of these tests in the separate patches below, but basically they fall into two types:
1. Tests which pass with today's materialized views and Alternator GSI/LSI, and serve to ensure that whatever changes I do to the view update implementation, doesn't break corner cases that already worked.
2. Tests for the UpdateTable feature in Alternator which doesn't work today so xfail - and will need to work for #11567. We already had a few tests for this, but here I add more and improve coverage of various corner cases I discovered while implementing the featue.
I already have a working prototype for #11567 which passes all these tests. Many of these tests helped exposed various bugs in earlier versions of my code.
Closesscylladb/scylladb#21927
* github.com:scylladb/scylladb:
test/cqlpy: a few more functional tests for materialized views
test/alternator: more tests for UpdateTable create and delete GSI
test/alternator: make UpdateTable tests wait less
test/alternator: move UpdateTable tests to a separate file
test/alternator: add another test for elaborate GSI updates
test/alternator: test that DescribeTable returns IndexStatus for GSI
test/alternator: fix wrong test for UpdateTable metrics
test/alternator: add test for missing attribute in item in LSI
test/alternator: test that DescribeTable doesn't return IndexStatus for LSI
test/alternator: add tests for RBAC for create and delete GSI
When sending by ID we should check that we do not translate our old
address to our ID and sending locally. mark_alive should not be called
with node's old ip anyway.
This series attempts to get read of flakiness in `cache_algorithm_test` by solving two problems.
Problem 1:
The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.
Problem 2:
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
Fixes#21536
Should be backported to prevent flaky failures in older branches.
Closesscylladb/scylladb#21948
* github.com:scylladb/scylladb:
cache_algorithm_test: harden against stats being confused by background activity
cache_algorithm_test: fix a use of an uninitialized variable
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.
Added a testcase to verify the fix.
Fixes#21887
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When an sstable is unlinked or deactivated, it should be removed from
the component memory tracking metrics and any further reload/reclaim
should be disabled. This patch adds a new method that implements the
above mentioned functionality. This patch also updates the deactivate()
to use the new method. Next patch will use it to disable tracking when
an sstable is unlinked.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a new method to disable reload of previously reclaimed components
from the sstable. This will be used to disable reload of bloom filters
after an sstable has been unlinked or deactivated.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
As part of #18750, we added a CQL statement CREATE ROLE WITH SALTED HASH
that prevented hashing a password when creating a role, effectively leading
to inserting a hash given by the user directly into the database. In #21350,
we noticed that Cassandra had implemented a CQL statement of similar semantics
but different syntax. We decided to rename Scylla's statement to be compatible
with Cassandra. Unfortunately, we didn't notice one more difference between
what we had in Scylla and what was part of Cassandra.
Scylla's statement was originally supposed to only be used when restoring
the schema and the user needn't have to be aware of its existence at all:
the database produced a sequence of CQL statements that the user saved to
a file and when a need to restore the schema arose, they would execute
the contents of the file. That's why that although we documented the feature,
it was only done in the necessary places. Those that weren't related to
the backup & restore procedure were deliberately skipped.
Cassandra, on the other hand, added the statement for a different purpose
(for details, see the relevant issue) and it was supposed to be used by
the user by design. The statement is also documented as such.
Since we want to preserve compatibility with Cassandra, we document
the statement and its semantics in the user documentation, explicitly
implying that it can be used by the user.
Fixesscylladb/scylladb#21691
We add a new test verifying that after creating a role with a hashed
password using one of the supported encryption algorithms: bcrypt,
sha256, sha512, or md5, the user can successfully log in.
If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.
The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.
Fixes#21826
When doing normal load balancing, we can ignore DOWN nodes in the node
set and just balance the UP nodes among themselves because it's ok to
equalize load just in that set, it improves the situation.
It's dangerous to do that when draining because that can lead to
overloading of the UP nodes. In the worst case, we can have only one
non-drained node in the UP set, which would receive all the tablets of
the drained node, doubling its load.
It's safer to let the drain fail or stall. This is decided by topology
coordinator, currently we will fail (on barrier) and rollback.
Explicitly disable tablets in a few tests that rely on features not yet supported with tablets.
Closesscylladb/scylladb#21070
* github.com:scylladb/scylladb:
test: disable tablets in test_raft_fix_broken_snapshot
test: disable tablets in test_raft_recovery_stuck
test: disable tablets in tet_raft_recovery_majority_lost
test: don't run test_raft_recovery_basic with tablets
test: fix test_writes_to_previous_cdc_generations work with tablets
test: fix topology_custom/test_mv_topology_change.py to work with tablets
test: correct replication factor in test_multidc.py
test: update test_view_build_status to work with tablets
test: fix test_change_rpc_address with tablets.
test: explicitly disable tablets in test_gropu0_schema_versioning
test: disable tablets in topology/test_mutation_schema_change.py
test: disable tablets in topology/test_mv.py
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.
This patch adds a few more functional tests for the CQL materialized
view feature in the cqlpy. The new tests pass, but helped me catch bugs (and
understand what are *not* bugs) while refactoring some view update code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have in test_gsi_updatetable.py several functional tests for
the Alternator feature of adding or deleting a GSI on an existing table,
through the UpdateTable operation.
This patch adds many more tests for various corner cases of this feature -
tests developed in parallel with actually implementing that feature.
All test in test_gsi_updatetable.py pass on Amazon DynamoDB but currently
xfail on Alternator, due to the following issues:
* #11567: Alternator: allow adding a GSI to a pre-existing table
* #9424: Alternator GSIs should exclude items with empty-string key components
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The UpdateTable tests for creating and deleting a GSI need to wait for
the asynchronous operation of the view's building and deletion, using
two utility functions wait_for_gsi() and wait_for_gsi_gone().
Because I originally wrote these tests for DynamoDB and its extremely
high latency for these operations, these functions waited a whole second
before checking for the end of the wait. This whole-second sleep is
absurd in Alternator where building a small view takes just a fraction of
a second. So let's lower the sleep time from 1 second to 0.1 seconds,
and allow these tests to pass much faster on Alternator (once this
feature is implemented in Alternator, of course - until then all these
tests still fail immediately on an unimplemented operation).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The source file test/alternator/test_gsi.py has already grown very
large, so this patch moves all the existing tests related to using
UpdateTable to add or delete a GSIs to a separate file:
test_gsi_updatetable.py.
We just move tests here - no new tests or functional changes to the
tests - but did use the opportunity for some small improvements in
the comments.
In the next patch we'll add more tests to this new file.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have a test, test/alternator/test_gsi.py::test_update_gsi_pk which
created a GSI whose *partition key* was a regular column in the base
table, and exercised various elaborate updates requiring adding,
updating and deleting of rows from the materialized view.
In this patch, we add another similar test case, just for a *clustering
key*.
Both these tests are important regression tests - when we later
reimplement GSI we'll want to verify that none of the complex update
scenarios got broken (and indeed, some broken code did break these
tests).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test reproducing issue #11471 - where DescribeTable
on a table that as an already built GSI (creating with the table itself)
must return IndexStatus == "ACTIVE".
This test passes on DynamoDB, but xfails on Alternator because of
issue #11471.
We actually had this check earlier, but it was part of a bigger xfailing
tests that checked multiple features. It's better to have it as a
separate test just for this feature, as we'll soon fix this issue and
make this test pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test we had for counting Alternator operations metrics ran the
UpdateTable request without any parameters, which isn't actually a
valid call - Amazon DynamoDB rejects such a call, saying one of the
different parameters must be present, and we'll want to do that
later too.
So let's fix the test to use a valid UpdateTable request, one that
does the silly BillingMode='PAY_PER_REQUEST'. This is already the
current setting, so nothing is really changed, but it's still counted
as an operation in the metric.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Test that when a table has an LSI, then if the indexed attribute is
missing, the item is added to the base table but not the index.
We already have exactly the same test for GSI in test_gsi.py, but forgot
to do write the same test for LSI. It's important to test this scenario
separately for GSIs and LSIs because in an upcoming GSI reimplementation
we plan to make the GSI and LSI implementation slightly different, and
they can have separate bugs (and in fact, we had such an LSI-specific
bug in one broken implementation).
We also have the same scenario that is tested here in the test
test_streams.py::test_streams_updateitem_old_image_lsi_missing_column
but that was a Alternator Streams test and we should have a more basic
test for this scenario in test_lsi.py.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Whereas GSIs have an IndexStatus when described by DescribeTable,
LSIs do not. The purpose of IndexStatus is to tell when the index is live,
and this is not needed for LSIs because they cannot be added to a base
table that already exists.
We already had a test for this, but it was hidden in an xfailing test
for many different DescribeTable attributes - so let's move it into it's
own, *passing*, test. The new tests passes on both Alternator and
Amazon DynamoDB.
This test is an important regression test for when we later add
IndexStatus support to GSI, and this test will ensure that we don't
accidentally introduce IndexStatus to LSIs as well - DynamoDB doesn't
generate it for LSIs so neither should Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In later patches we will implement (as requested in issue #11567) the
UpdateTable operation for creating a new GSI or removing a GSI on an
existing table. In this patch we add to test/alternator/test_cql_rbac.py
tests to exhaustively check that the new operations will behave as expected
in respect to role-based access control (RBAC):
1. UpdateTable requires the ALTER permissions on the affected table -
as was already the case before (and was documented in compatibility.md).
This should also be true for the newly-implemented UpdateTable
operations that create a GSI and delete a GSI, and we test that.
The above statement may sound counter-intuitive - why does creating
or deleting a GSI require ALTER permissions (on the base table), not
CREATE or DROP permissions? But this makes sense when you consider
that CREATE permissions should allow you create new independent tables,
not to change the behavior or performance of existing tables (which
adding a GSI does).
2. When a role has permissions to create a GSI, it should be able to
read the new GSI (SELECT permissions). This is known as "auto-grant".
3. When a GSI is deleted, whatever permissions was set on it is revoked,
so that if it's later recreated, the old permissions don't resurface.
This is known as "auto-revoke".
Because the UpdateTable feature for creating and deleting a GSI is not
yet enabled, the new tests are all marked "xfail".
The new tests, like all tests in the file test/alternator/test_cql_rbac.py
are Scylla-only and are skipped on Amazon DynamoDB - because they test
the Scylla-only CQL-based role-based access control API.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Empty plan with nodes to drain meant that we can exit tablet_draining
transition and move to the next stage of decommission/removenode.
In case tablet scheduler creates an empty plan for some reason but
there are still underained tablets, that could put topology in an
invalid state. For example, this can currently happen if there
are no non-draining nodes in a DC.
This patch adds a safety net in the topology coordinator which
prevents moving forward with undrained tablets.
In tablets mode, it is not allowed to CREATE a table
if replication factor can be satisfied. E.g. if the keyspace
is defined to have replication_factor = 3 and there
are only 2 replicas, in vnodes mode one still can
CREATE the table and write to it, whereas in tablets
mode one gets an error.
The confusion is what 'replication_factor' means.
When NetworkTopologyStrategy is used, in multi-dc mode, each DC must
have at least 'replication_factor' replicas and stores
'replication_factor' copies of data.
The test author (as well as the author of this "fix", see
my confused report of gh-21166) assumed that 'replication_factor'
means the total number of replicas, not the number of replicas
per DC.
Correct the test to use only one replica per DC, as this is the
topology the test is working with. The test is not specific
to the number of replicas, so the change does not impact
the logic of the test.
With tablets, it's not allowed to create a table in a keyspace
which replication factor exceeds the actual number of nodes in the
cluster.
Pass the replication factor to random_tables fixture so that
a keyspace with a correct replication_factor is created.
The test file contains two test cases, which both test
materialized view tombstone gc settings. With tablets the default
is "repair" which is different from vnodes.
The tests are testing that the gc settings are not inherited. With
tablets, the gc settings are forced. This is indistinguishable from
inheriting, so the tests are failing when run with tablets.
this issue was identified by clang-20.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21835
* github.com:scylladb/scylladb:
locator: remove unused member variable
auth: remove unused member variable
db: remove unused member variable
When investigating issue #21724, the docstring for
`test_recover_stuck_raft_recovery` was found to be difficult to follow.
Restructured the docstring into an ordered list to:
1. Improve readability
2. Clearly outline the test steps
3. Make the test's logic and flow more immediately comprehensible
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21728
The /column_family/compaction_strategy has GET and POST implemented, the
latter changes the strategy on the table.
Unknown strategy name implicitly renders internal server error code by
catching exception from compaction_strategy::type() that tries to
convert strategy name string to strategy enum class type.
This is to finish validation of #21533
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21569
This pull request is continuation of scylladb/scylladb#20688 - contents of the main commit are the same, the only change is the additional commit with a test.
Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays.
This hard-coded one maximum second delay was considered *huge* - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client.
So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before.
Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it.
The new parameter's help string mentions both these use cases of the parameter.
Fixes#18187
This is new functionality, no need to backport to any open source release.
Closesscylladb/scylladb#21647
* github.com:scylladb/scylladb:
materialized views: test for the MV delay configuration parameter
service: add injection for skipping view update backlog
materialized view: make flow-control maximum delay configurable
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)
Let's add a comment to notify the user we have conflicts
Closesscylladb/scylladb#21939
There is an assumption that every destroyed compaction_group will be stopped first.
Otherwise, the group is still referenced by compaction manager and can use it after
freed. That's what happened in issue #21867 in the context of merge.
The issue is pre-existing but was made more likely with merge.
One problem is a race between split and cleanup, where if split is emitted while
cleanup is stopping groups, it can happen split preparation adds new groups that will
never be closed, since cleanup is already past the group stopping step.
Another problem found is that split completion handler is not accounting for possible
existence of merging groups, if split happens right after merge. Split completion
handler should stop all empty groups that previously had data split from them.
The problems will be fixed by guaranteeing that new groups will not be added for a
tablet being migrated away, and that empty groups are properly closed when handling
split completion.
A reproducer was added.
Fixes#21867.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#21920
Standardize on one range library to reduce dependency load.
Unfortunately, std::views::concat (the replacement for boost::join),
is C++26 only. We use two separate inserts to the result vector to
compensate, and rationalize it by saying that boost::join() is likely
slow due to the need for type-erasure.
Closesscylladb/scylladb#21834
remove unnecessary _mark_dirty call
server_broken_event - stop the whole file execution
(prevent the next tests from running because
Pyhon server object is broken PR: scylladb/scylladb#18236).
and next file execution will create its new cluster
so _mark_dirty will not change anything
Closesscylladb/scylladb#21429
In case the update is rolled back on error, call clear_gently
for table_erms and view_erms to prevent potential stalls
with a large number of tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After last patch, we deliberately don't yield between
update of base table erm and updating its view,
which was the scenario tested with the `delay_after_erm_update`
error injection point.
Instead, call maybe_yield in between base/views updates
to prevent reactor stalls with many tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)
To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.
Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786
This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a memebr and
storage_group_manager::stop() consumes this future during table::stop().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before this change, we didn't quote the array of the keys of an
associative array. and shellcheck warns like:
```
In install-dependencies.sh line 330:
for package in ${!pip_packages[@]}
^-----------------^ SC2068 (error): Double quote array expansions to avoid re-splitting elements.
```
While the current keys in the associative array do not contain spaces,
quoting array expansions is a recommended defensive programming practice.
This change:
- Prevents potential future issues with unexpected whitespace
- Silences Shellcheck warning without changing functionality
- Improves code quality and maintainability
Specifically modified the array iteration from:
`for package in ${!pip_packages[@]}`
to:
`for package in "${!pip_packages[@]}"`
This change has no functional impact and serves as a proactive code improvement.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
"declare -A local GO_ARCH" does not define a single variable, instead
it defines two variables named "local" and "GO_ARCH". shellcheck
warns when analyzing this script:
```
In ./install-dependencies.sh line 188:
declare -A local GO_ARCH=(
^---^ SC2316 (error): This applies declare to the variable named local, which is probably not what you want. Use a separate command or the appropriate `declare` optionsinstead.
^---^ SC2034 (warning): local appears unused. Verify use (or export if used externally).
```
and per the output of "help declare":
```
declare: declare [-aAfFgiIlnrtux] [name[=value] ...] or declare -p [-aAfFilnrtux] [name ...]
```
we defined two associative arrays instead of one.
In this change, we use the correct Bash syntax `local -A GO_ARCH` to:
- Create a single, locally-scoped associative array
- Eliminate static analysis warnings
- Improve code readability and maintainability
This is a cleanup change with no production impact.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
They exist in the original documentation, but are not yet implemented.
Now it's possible to do it.
It slightly more complex that its compaction counterpart in a sense than
get method reports megabits/s by default and has an option to convert to
MiBs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are now four of those and these are all the same in the way they
interpret the value parameter (though it's named differently)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In this change, tablet_virtual_task starts supporting tablet
migration, in addition to tablet repair. Both tablet operations
reuse the same virtual_task because their task data is retrieved
similarly. However, it changes nothing from the task manager
API users' perspective. They can list running migrations or check
their statuses all the same as if migration had its own virtual_task.
Users can see running migration tasks - finished tasks are not
presented with the task manager API. However, the result
of the migration (whether it succeeded or failed) would be
presented to users, if they use wait API.
If a migration was reverted, it will appear to users as failed.
We assume that the migration was reverted, when its destination
does not contain a tablet replica.
Fixes: https://github.com/scylladb/scylladb/issues/21365.
No backport, new feature
Closesscylladb/scylladb#21729
* github.com:scylladb/scylladb:
test: boost: check migration_task_info in tablet_test.cc
replica: add repair related fields to tablet_map_to_mutation
test: add tests to check the failed migration virtual tasks
test: add tests to check the list of migration virtual tasks
test: add tests to check migration virtual tasks status
test: topology_tasks: generalize repair task functions
service: extend tablet_virtual_task::abort
service: extend tablet_virtual_task::wait
service: extend tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: tasks: make get_table_id a method of virtual_task_hint
service: tasks: extend virtual_task_hint
replica: service: add migration_task_info column to system.tablets
locator: extend tablet_task_info to cover migration tasks
locator: rename tablet_task_info methods
Both values are in fact db::config named values. They are observed by,
respectively, compaction manager and stream manager: when changed, the
observer kicks corresponding sched group's update_io_bandwidth() method.
Despite being referenced by managers, there's no way to update those
values anyhow other than updating config's named values themselves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some endpoints in config block will need to actually _update_ values on
config (see next patches why), and const reference stands on the way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to get stream throughput, the API will need stream_manager.
In order to set stream throughput, the API will need db::config to
update the corresponding named value on it.
Said that, move the endpoints to relevant blocks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to update compaction throughput API would need to update the
db::config value, so the endpoint in question should sit in the block
that has db::config at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The operator() of named_value() prints the allowed values on error which
can be a vector, so the ranges formatting should be there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In Scylla, we can have either `closed` or `merged` PRs. Based on that we decide when to start the backport process when the label was added after the PR is closed (or merged),
In https://github.com/scylladb/scylladb/pull/21876 even when adding the proper backport label didn't trigger the backport automation. Https://github.com/scylladb/scylladb/pull/21809/ caused this, we should have left the `state=closed` (this includes both closed and merged PR)
Fixing it
Closesscylladb/scylladb#21906
* seastar 72c7ac575...3133ecdd6 (12):
> util/backtrace: Optimize formatter to reduce memory allocation overhead
> scheduler: Report long queue stall
> log: drop specialization of boost::lexical_cast for log_level
> stall-detector: Remove unused _stall_detector_reports_per_minute
> Merge 'when_all: add Sentinel support to when_all_succeed() ' from Kefu Chai
> scripts/perftune.py: Implement AWS IMDSv2 call
> net/tls: Add a way to disable certificate validation
> tests: Improve websocket parser tests
> scripts/stall-analyser: improve error messages on invalid input
> reserve-memory: document that seastar just doesnt use the reserves
> Merge 'Minor metrics memory optimizations' from Stephan Dollberg
> json_formatter: Add support for standard range containers
Closesscylladb/scylladb#21869
Store the endpoint ip address together with the client (note it may be
different from the address the client is connected to in case
preferable address is different). This allows up to drop lookup in the
address map which may eventually fail if an endpoint was already
deleted.
Fixes: scylladb/scylladb#21840
Message-ID: <Z1mpMMe-o0ggBU_F@scylladb.com>
"
The series moves node ops, repair and streaming verbs to IDL. Also
contains IDL related cleanups.
In addition to the CI tested manually by bootstrapping a node with the
series into a cluster of old nodes with repair and streaming both in
gossiper and raft mode. This exercises repair, streaming and node_ops
paths.
"
* 'gleb/move-more-rpcs-to-idl-v3' of github.com:scylladb/scylla-dev:
repair: repair_flush_hints_batchlog_request::target_nodes is not used any more, so mark it as such
streaming: move streaming verbs to IDL
messaging_service: move repair verbs to IDL
node_ops: move node_ops_cmd to IDL
idl: rename partition_checksum.dist.hh to repair.dist.hh
idl: move node_ops related stuff from the repair related IDL
This change goes thru locator:topology to use node&
instead of node* where nullptr is not possible. There are
places where the node object is used in unordered_set, in
those cases the node is wrapped in std::reference_wrapper.
Fixesscylladb/scylladb#20357Closesscylladb/scylladb#21863
Reads which need sstable index were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1-2% of time.
Computing it involves type name parsing.
Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the least-recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.
Also, fixes a potential use-after-free on schema in column_translation.
Closesscylladb/scylladb#21347
* github.com:scylladb/scylladb:
sstables: Fix potential use-after-free on column_translation::column_info::name
sstables: Avoid computing column_values_fixed_lengths on each read
Replace manual subrange advancement with the more concise and readable
`subrange.advance()` method. This change:
- Eliminates unnecessary subrange instance creation
- Improves code readability
- Reduces potential for unnecessary object allocation
- Leverages the built-in `advance()` method for cleaner iterator handling
The modification simplifies the iteration logic while maintaining the
same functional behavior.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21865
Extend tablet_virtual_task::wait to support migration tasks.
To decide what is a state of a finished migration virtual task
(done or failed), the tablet replicas are checked. The task state
is set to done, if the replicas contain the destination of a tablet
migration.
Extend tablet_virtual_task::get_status_helper to cover migration
tasks. get_status_helper is used by get_status and wait methods.
Waiting for a task in the latter will be modified in the following
patch.
Extend tablet_virtual_task::contains to check migration operations.
Returned virtual_task_hint contains also tablet_id (only for migration
tasks) and task_type.
Return immediately from methods that do not support migration
for non-repair task types. The methods' support for migration
will be implemented in the following patches.
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733Closesscylladb/scylladb#21876
Replace `dht/sharder.hh` with a "smaller" header, which provides
just the enough dependencies.
in f744007e, we traded `database.hh` with a smaller set of headers.
but it turns out `dht/sharder.hh` can be replaced with a even smaller
one. because `dht::sharder` is defined by `dht/token-sharding.hh`, and
what we need from `dht/sharder.hh` is this class's declaration.
`clang-include-cleaner` identified this issue.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21881
Add migration_task_info column to system.tablets. Set migration_task_info
value on migration request if the feature is enabled in the cluster.
Reflect the column content in tablet_metadata.
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.
The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.
Fixesscylladb/scylladb#21763Closesscylladb/scylladb#21773
Test cases related to #21826:
1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to
remove a node with another node down in the datacenter, leaving
no normal token owners in that dc (reproducing #21826).
Removenode is expected to fail in this case since it
should have no place to rebuild the removed node replicas,
yet it currently succeeds unexpectedly.
2. test_remove_failure_then_replace: verify that removenode
fails as expected when there are not enough nodes to
rebuild its replicas on, with and without additional zero-token nodes.
3. test_replace_with_no_normal_token_owners_in_dc: verify that
nodes can be replaced in a datacenter that has no live
token owners, with and without additional zero-token nodes.
Tablet replace uses all replicas to rebuild the lost replicas
and therefore should succeed in the edge case.
The restored data is verified as well.
Refs #21826
* New tests, no backport needed
Closesscylladb/scylladb#21827
* github.com:scylladb/scylladb:
topology_custom/test_tablets: add remove/replace tests for edge cases
test: pylib: _cluster_remove_node: log message on successful paths
test: pylib: _cluster_remove_node: mark server as removed only when removenode succeeded
Test cases related to #21826:
1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to
remove a node with another node down in the datacenter, leaving
no normal token owners in that dc (reproducing #21826).
Removenode is expected to fail in this case since it
should have no place to rebuild the removed node replicas,
yet it currently succeeds unexpectedly.
2. test_remove_failure_then_replace: verify that removenode
fails as expected when there are not enough nodes to
rebuild its replicas on, with and without additional zero-token nodes.
3. test_replace_with_no_normal_token_owners_in_dc: verify that
nodes can be replaced in a datacenter that has no live
token owners, with and without additional zero-token nodes.
Tablet replace uses all replicas to rebuild the lost replicas
and therefore should succeed in the edge case.
The restored data is verified as well.
Refs #21826
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped.
It works with tablets, but is not safe. A concurrent replication process may bring back old data.
This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request.
Backporting is not needed.
Fixes#16411Closesscylladb/scylladb#19789
* github.com:scylladb/scylladb:
docs: docs: topology-over-raft: Document truncate_table request
storage_proxy: fix indentation and remove empty catch/rethrow
test: add tests for truncate with tablets
storage_proxy: use new TRUNCATE for tablets
truncate: make TRUNCATE a global topology operation
storage_service: move logic of wait_for_topology_request_completion()
RPC: add truncate_with_tablets RPC with frozen_topology_guard
feature_service: added cluster feature for system.topology schema change
system.topology_requests: change schema
storage_proxy: propagate group0 client and TSM dependency
Everywhere replication strategy returns zero host id in replica set instead
of the real one if no tokens are configured yet in token metadata. It
worked because code that translates ids to ips knows that zero host id
is a special one, so putting zero there was equivalent to allow local
access. But now we use host ids directly so we need to return real host
id here to allow local access before token metadata is populated.
Message-ID: <Z1hBHsEo4wYzzgvJ@scylladb.com>
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.
Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.
As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.
We want to backport this change to all supported branches so that
it helps us in all tests.
Closesscylladb/scylladb#21855
When tablet scheduler drains nodes, it chooses target location based
on "badness" metric. Nodes with lowest score are preferred. Before the
patch, the score which was used was the number of tablets on that node
post-movement. This way we populate least-loaded node first. But this
works only if nodes have equal number of shards. If nodes have different
capacity, then number of tablets is not a good metric, because we don't
aim to equalize per-node count, but per-shard count. We assume that each
shard has equal capacity.
Because of this bug, during decommission, the nodes with fewer shards
would be preferred to receive replicas, which may lead to overloading
of those nodes. This imbalance would be later fixed by the normal load
balancing logic, but it's still problematic.
Fixes#21783Closesscylladb/scylladb#21860
Log a message when removenode succeeded as expected
or when it failed as expected with the `expected_error`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, we call server_mark_removed also when removenode
failed with the `expected_error`, where the function returns success
but the server is not supposed to be in a removed state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
TRUNCATE TABLE saves the current commit log replay positions in case there is a crash so that replay knows where to begin replaying the mutations. These are collected and saved per shard into `system.truncated`. In case a shard received no mutations, its replay position will be an empty, default constructed object of type `db::replay_position` with its members set to 0. Truncate will incorrectly interpret these empty replay positions as if they were coming from shard 0, and save them as such, potentially overwriting an actual valid replay position coming from the actual shard 0. In the case of a crash, this will cause the commit log on shard 0 to be replayed from the beginning, and result with data resurrection.
Fixes#21719Closesscylladb/scylladb#21722
* github.com:scylladb/scylladb:
test: add test for truncate saving replay positions
database: correctly save replay position for truncate
On startup, if a server reads an sstable that belongs to a tablet that
doesn't have any local replica, it throws an error in the following
format and refuses to start :
```
Storage wasn't found for tablet 1 of table test.test
```
This patch updates the code path to throw a nicer error that includes
the sstable name that caused the problem.
This patch also adds a testcase to verify the error being thrown.
Fixes https://github.com/scylladb/scylladb/issues/18038
PR improves an error message - no need to backport.
Closesscylladb/scylladb#21805
* github.com:scylladb/scylladb:
replica/table: fix indent in compaction_group_for_sstable
replica/table: improve error message when encountering orphaned sstables
Replace boost::make_iterator_range() with std::ranges::subrange.
This change improves code modernization and reduces external dependencies:
- Replace boost::make_iterator_range() with std::ranges::subrange
- Remove boost/range/iterator_range.hpp include
- Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range>
This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21787
One of run_when_memory_available() checks mirrors the one done by the
execution_permitted() helper, so its worth re-using it. Since the former
helper is header template, the latter is worth moving to header too.
And, once re-used, the `bool blocking` variable becomes excessive, and
the `if (blocking)` check can also be expressed with fewer LOCs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21812
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to support fmt < 10.
in this change, we drop the support fmt < 10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21847
Artifact tests have been failing since the switch to the native nodetool, because ScyllaDB doesn't leave any IOCBs for tools. On some setups it will consume all of them and then nodetool and any other native app will refuse to start because it will fail to allocate IOCBs.
This PR fixes this by making use of the freshly introduced `--reserve-io-control-blocks` seastar option, to reserve IOCBs for tool applications. Since the `linux-aio` and `epoll` reactor backends require quite a bit of these, we enable the `io_uring` reactor backend and switch tools to use this backend instead. The `io_uring` reactor backend needs just 2 IOCBs to function, so the reserve of 10 IOCBs set up in this PR is good for running 5 tool applications in parallel, which should be more than enough.
Fixes: https://github.com/scylladb/scylladb/issues/19185
The problem this PR fixes has a manual workaround (and is rare to begin with), no backport needed.
Closesscylladb/scylladb#21527
* github.com:scylladb/scylladb:
main: configure a reserve IOCB for scylla-nodetool and friends
configure: enable the io_uring backend
main: use configure seastar defaults via app_template::seastar_options
Use raw string literals to prevent syntax warnings when using regular
expressions with backslash-based patterns.
The original code triggered a SyntaxWarning in developer mode (`python3 -Xdev`)
due to unescaped backslash characters in regex patterns like '\s'. While
CPython typically interprets these silently, strict Python parsing modes
raise warnings about potentially unintended escape sequences.
This change adds the `r` prefix to string literals containing regex patterns,
ensuring consistent behavior across different Python runtime configurations
and eliminating unnecessary syntax warning like:
```
/opt/scylladb/scripts/libexec/scylla_io_setup:41: SyntaxWarning: invalid escape sequence '\s'
pattern = re.compile(_nocomment + r"CPUSET=\s*\"" + _reopt(_cpuset) + _reopt(_smp) + "\s*\"")
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21839
In the cleanup commit a840949ea0
a regression was introduced that caused backward incompatible changes
in the gossiper application state name strings.
In the e486e0f759 the value
`application_state::CDC_STREAMS_TIMESTAMP` was changed to
`application_state::CDC_GENERATION_ID`, but the name string
"CDC_STREAMS_TIMESTAMP" was kept for backward compatibility.
The cleanup commit a840949ea0 however
changed the name string to "CDC_GENERATION_ID" by ommission (not noticing
the difference) which caused backward incompatible change.
There is also another case found of "IGNOR_MSB_BITS" (that has a typo -
missing the "E" in "IGNORE") to "IGNORE_MSB_BITS", which also needs to
be reverted back to keep the backward compatibility.
Fixes: scylladb/scylladb#21811Closesscylladb/scylladb#21813
This patch adds the unit tests for truncate with tablets.
test_truncate_while_migration() triggers a tablet migration, then runs
a TRUNCATE TABLE for the table containing the tablet being migrated.
test_truncate_with_concurrent_drop() starts a truncate, then attempts to
drop the table while it is being truncated.
test_truncate_while_node_restart() validates the case where a replica
node is restarted while truncate is running.
test_truncate_with_coordinator_crash() validates if truncate is
correctly completed in cases where the topology coordinator has crashed
or restarted after the truncate session is cleared, but before the
truncate request is finalized.
This commit adds the code needed to create a TRUNCATE global topology
request. It also adds the handler for this request to the topology
coordinator.
The execution of the truncate operation is not canceled on a timeout,
but the query coordinator side will return a timeout error.
column_translation::state is storing pointers to column names, which
are stable only as long as schema_ptr is alive. sstable object caches
last used column_translation, and reuses column_translation::state if
the schema version matches. But this doesn't guarantee that the schema
object was not destroyed and recreated in between. This can happen if
the schema version expired in registry and then was pulled again from a
different node via get_schema_for_read().
Spotted by reading the code.
Fix by storing schema_ptr in column_translation. This can pin old
schema in memory until a newer schema is used to read the sstable, or
until sstable is compacted away. I think this shouldn't be a problem
in practice.
Reads which need clustering index cursor were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1%.
Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the most recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.
The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.
The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.
Fixesscylladb/scylladb#20483Closesscylladb/scylladb#21748
Replace boost::join() with std::ranges::join_view() as an interim solution
before C++26's std::views::concat becomes available. This change:
- Reduces dependencies on the Boost Ranges library
- Moves closer to standard library implementations
- Improves code maintainability and future compatibility
This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21786
Clean up the unnecessary includes reported by the GitHub checks that are
polluting the PR diffs.
The "utils/assert.hh" report should be actually fixed by the #21739, but
as the usage of `SEASTAR_ASSERT()` is protected by the `SEASTAR_DEBUG`
check it makes sense to include the header conditionally as well.
Closesscylladb/scylladb#21817
database.hh is a heavyweight include file with a lot of fan-in.
auto_refreshing_sharder.hh has a lot of fan out. The combination
means a large dependency load.
Deinline the class and use forward declarations to avoid the #include.
There is no expected performance impact because all the functions are
virtual.
Ref #1
Note: this shouldn't belong in dht, but be injected by a higher layer,
but this isn't addressed by the patch.
Closesscylladb/scylladb#21768
The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.
Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.
Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.
Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.
The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.
When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.
Fixes#18181.
system test details:
test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml
instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.
latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode: write
99.9th: 3.145727ms
99th: 1.998847ms
99.9th: 3.145727ms
99th: 2.031615ms
Mode: read
99.9th: 3.145727ms
99th: 2.031615ms
99.9th: 3.145727ms
99th: 2.031615ms
Mode: write
99.9th: 3.047423ms
99th: 1.933311ms
99.9th: 3.047423ms
99th: 1.933311ms
Mode: read
99.9th: 3.145727ms
99th: 1.900543ms
99.9th: 3.145727ms
99th: 1.900543ms
Mode: write
99.9th: 5.079039ms
99th: 3.604479ms
99.9th: 35.389439ms
99th: 25.624575ms
Mode: write
99.9th: 3.047423ms
99th: 1.998847ms
99.9th: 3.047423ms
99th: 1.998847ms
Mode: read
99.9th: 3.080191ms
99th: 2.031615ms
99.9th: 3.112959ms
99th: 2.031615ms
```
Closesscylladb/scylladb#20572
* github.com:scylladb/scylladb:
docs: Document tablet merging
tests/boost: Add test to verify correctness of balancer decisions during merge
tests/topology_experimental_raft: Add tablet merge test
service: Handle exception when retrying split
service: Co-locate sibling tablets for a table undergoing merge
gms: Add cluster feature for tablet merge
service: Make merge of resize plan commutative
replica: Implement merging of compaction groups on merge completion
replica: Handle tablet merge completion
service: Implement tablet map resize for merge
locator: Introduce merge_tablet_info()
service: Rename topology::transition_state::tablet_split_finalization
service: Respect initial_tablet_count if table is in growing mode
service: Wire migration_tablet_set into the load balancer
locator: Add tablet_map::sibling_tablets()
service: Introduce sorted_replicas_for_tablet_load()
locator/tablets: Extend tablet_replica equality comparator to three-way
service: Introduce alias to per-table candidate map type
service: Add replication constraint check variant for migration_tablet_set
service: Add convergence check variant for migration_tablet_set
service: Add migration helpers for migration_tablet_set
service/tablet_allocator: Introduce migration_tablet_set
service: Introduce migration_plan::add(migrations_vector)
locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
locator/tablets: Introduce tablet_map::needs_merge()
locator/tablets: Introduce resize_decision::initial_decision()
locator/tablets: Fix return type of three-way comparison operators
service: Extract update of node load on migrations
service: Extract converge check for intra-node migration
service: Extract erase of tablet replicas from candidate list
scripts/tablet-mon: Allow visualization of tablet id
On startup, if a server reads an sstable that belongs to a tablet that
doesn't have any local replica, it throws an error in the following
format and refuses to start :
```
Storage wasn't found for tablet 1 of table test.test
```
This patch updates the code path to throw a nicer error that includes
the sstable name that caused the problem.
This patch also adds a testcase to verify the error being thrown.
Fixes#18038
Although `crc_check_chance` is accepted as a configuration option in ScyllaDB,
the value is currently ignored during runtime. This change makes this behavior
explicit in the documentation to prevent potential user misunderstandings.
Changes:
- Explicitly document that the option is currently a no-op
- Provide clear guidance on the current implementation
- Prevent confusion about the option's actual functionality
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21794
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.
This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.
A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.
Fixes: scylladb/scylladb#20082Closesscylladb/scylladb#21600
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.
Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
the step to always execute, regardless of `github.event_name`
Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.
Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements
Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.
Example Error:
```
python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
shell: /usr/bin/bash -e {0}
env:
DEFAULT_BRANCH: master
GITHUB_TOKEN: ***
Traceback (most recent call last):
File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
main()
File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
commits = repo.compare(start_commit, end_commit).commits
File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
headers, data = self._requester.requestJsonAndCheck(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
return self.__check(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```
Fixesscylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21809
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.
Fixesscylladb/scylladb#20754
The PR should be backported to 6.2, this version has view builder on group0.
Closesscylladb/scylladb#21708
* github.com:scylladb/scylladb:
test/topology_custom/test_view_build_status: add reproducer
service/topology_coordinator: migrate view builder only if all nodes are up
This patch reverts 324b3c43c0 and adds synchronous versions of `service_level_controller::find_effective_service_level()` and `client_state::maybe_update_per_service_level_params()`.
It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.
Fixesscylladb/scylladb#21801Closesscylladb/scylladb#21761
* github.com:scylladb/scylladb:
Revert "generic_server: use async function in `for_each_gently()`"
transport/server: use synchronous calls in `for_each_gently` callback
service/client_state: add synchronous method to update service level params
qos/service_level_controller: add `find_cached_effective_service_level`
Small adjustments and improvements to the documentation in the raft
section.
Fixing Markdown lint warnings:
- MD004/ul-style: Unordered list style [Expected: dash; Actual: asterisk]
- MD007/ul-indent: Unordered list indentation [Expected: 0; Actual: 2]
- MD032/blanks-around-lists: Lists should be surrounded by blank lines
- MD036/no-emphasis-as-heading: Emphasis used instead of a heading
- MD046/code-block-style: Code block style [Expected: fenced; Actual: indented]
Closesscylladb/scylladb#21780
There is no point waiting for a node been replaced to appear in the
gossiper since it either will be there already or it will never appear.
gossiper:is_alive() knows how to handle both of those cases, so just
call it directly.
This reverts commit 324b3c43c0.
It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.
Fixesscylladb/scylla#21801
The test does the following:
- Enables an error injection which will cause further view updates to
get stuck, occupying space in memory and affecting the backlog,
- Performs a single, large write to the base table which causes a single
view update to be generated; the write is then followed with one more,
small write to make sure that the other write will be affected by the
first write's backlog,
- Reads relevant metrics in order to check the exact value of the delay
that was calculated for the base table write due to MV backpressure.
This is done for different values of the MV delay configuration
parameter (view_flow_control_delay_limit_in_ms) and the calculated
delays are collected into a list. Lastly, the test checks that the
relation between parameter value and the calculated delays is linear.
This series adds WCU support for the delete item operation.
It also splits the Alternator WCU metric by an ops label to give us better visibility of how much each ops contributes to the WCU calculation.
No need to backport to the open source
Closesscylladb/scylladb#21709
* github.com:scylladb/scylladb:
test_returnconsumedcapacity.py: Add delete Item tests
alternator/executor: Add WCU support for delete item
alternator/executer use uint in describe_item
alternator/consumed_capacity.hh: Make the total_bytes public
test_metrics validate split wcu_total to ops
Alternato: split WCU metrics into ops
Information about view update backlog is propagated in two main ways:
- In RPCs that serve as responses to writes (MUTATION_DONE /
MUTATION_FAILED)
- Via gossip (application_state::VIEW_BACKLOG)
In tests, it can be benefical to disable the second mechanism. View
update backlog propagation via write responses happens synchronously
with respect to writes so it is easier to control and reason about,
while gossip is asynchronous and can overwrite the backlog that was
propagated via write responses.
Add `skip_updating_local_backlog_via_view_update_backlog_broker` error
injection which skips the logic that updates the local, per-endpoint
cache of view update backlogs from the gossip state.
Until this patch, the materialized view flow-control algorithm
(https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/)
used a constant delay_limit_us hard-coded to one second, which means
that when the size of view-update backlog reached the maximum (10%
of memory), we delay every request by an additional second - while
smaller amounts of backlog will result in smaller delays.
This hard-coded one maximum second delay was considered *huge* - it will
slow down a client with concurrency 1000 to just 1000 requests per
second - but we already saw some workloads where it was not enough -
such as a test workload running very slow reads at high concurrency
on a slow machine, where a latency of over one second was expected
for each read, so adding a one second latecy for writes wasn't having
any noticable affect on slowing down the client.
So this patch replaces the hard-coded default with a live-updateable
configuration parameter, `view_flow_control_delay_limit_in_ms`, which
defaults to 1000ms as before.
Another useful way in which the new `view_flow_control_delay_limit_in_ms`
can be used is to set it to 0. In that case, the view-update flow
control always adds zero delay, and in effect - does absolutely
nothing. This setting can be used in emergency situations where it
is suspected that the MV flow control is not behaving properly, and
the user wants to disable it.
The new parameter's help string mentions both these use cases of
the parameter.
Fixes#18187
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
this workflow should be triggered either if a push event occurred or
pull_request_target (which mean someone added backport label)
It seems that due to wrong indentation the workflow wasn't trigger
during label add
Fixing it
Closesscylladb/scylladb#21791
And split it into two -- one for materialized view, another for
secondary index. This is to fit current cqlpy layout that has different
files for views and indexes.
refs: #21552
refs: #21551 (detached this patch from there, as that PR needs fix in
the core code)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21677
In this PR, we get rid of the unnecessary default switch case in `prefix_for_scheme()`. We also change the return type of the function to `std::string_view` as it's easier to operate on.
Backport: not needed; this is a code cleanup.
Closesscylladb/scylladb#21749
* github.com:scylladb/scylladb:
auth/passwords: Change return type of prefix_for_scheme to std::string_view
auth/passwords.cc: Remove default case in prefix_for_scheme()
This commit follows up on commit f436edfa22, which initially cleaned up
unused #include directives in the "mutation" subdirectory. This change
removes additional unused header files that were missed in the previous
cleanup.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21740
Introduce a new GitHub workflow to run shellcheck on changed shell
scripts. This workflow automatically detect and highlight potential
shell script issues in pull requests. This change is a follow-up to
commit 0700b322 which fixed an undefined variable issue in `install.sh`.
It intends to leverage static analysis to improve script quality and
catch potential errors early.
Shellcheck will now:
- Analyze all shell scripts modified in pull requests
- Provide inline comments with specific issue details
- Help prevent similar variable-related mistakes in the future
See also
https://github.com/redhat-plumbers-in-action/differential-shellcheck
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21755
This change introduces a new truncate_with_tablets RPC with a parameter
of type service::frozen_topology_guard. This is materialized on replica
nodes into a topology_guard which guarantees that truncate is performed
under a global session, which, in turn, makes sure that we don't execute
truncate as a result of stale RPCs.
Also, this RPC does not have a timeout. Timeout will be handled on the
coordinator side, and the truncate operation will not be allowed to time
out.
This patch adds a feature serive which protects the system.topology
schema change against situations where clusters are incompletely
upgraded to new a version and could be rolled back.
This commit makes storage_proxy::remote dependent on raft_group0_client
and topology_state_machine. storage_proxy::remote gets references to these via
the call to start_remote(). These references will be needed to call
storage_service::truncate_table_with_tablets().
With commits ed7d352e7d and bb1867c7c7, we now have input streams for both compressed and uncompressed SSTables that provide seamless checksum and digest checking. The code for these was based on `validate_checksums()`, which implements its own validation logic over raw streams. This has led to some duplicate code.
This PR deduplicates the uncompressed case by modifying `validate_checksums()` to use a checksummed input stream instead of a raw stream. The same cannot be done for compressed SSTables though. The reason is that `validate_checksums()` needs to examine the whole data file, even if an invalid chunk is encountered. In the checksummed case we support that by offloading the error handling logic from the data source via a function parameter. In the compressed data source we cannot do that because it needs to return decompressed data and decompression may fail if the data are invalid.
This PR also enables `validate_checksums()` to partially verify SSTables with just the per-chunk checksums if the digest is missing.
In more detail, this PR consists of:
* Port of some integrity checks from `do_validate_uncompressed()` to the checksummed data source. It should now be able to detect corruption due to truncated or appended chunks (expected number of chunks is retrieved from the CRC component).
* Introduction of `error_handler` parameter in checksummed data source and `data_stream()`.
* Refactoring of `validate_checksums()`. The JSON response of `sstable validate-checksums` was also modified to report a missing digest.
* Tests for `validate_checksums()` against SSTables with truncated data, appended data, invalid digests, or no digest.
Refs #19058.
This PR is a hybrid of cleanup and feature. No backport is needed.
Closesscylladb/scylladb#20933
* github.com:scylladb/scylladb:
tools/scylla-sstable: Rename valid_checksums -> valid
test: Check validate_checksums() with missing digest
sstables: Allow validate_checksums() to report missing digests
sstables: Refactor validate_checksums() to use checksummed data stream
sstables: Add error_handler parameter to data_stream()
sstables: Add error handler in checksummed data source
sstables: Check for excessive chunks in checksummed data source
sstables: Check for premature EOF in checksummed data source
test: test_validate_checksums: Check SSTable with invalid digest
test: test_validate_checksums: Check SSTable with appended data
test: test_validate_checksums: Complement test for truncated SSTable
Make use of the recently introduced reserve_io_control_blocks to ensure
some reserve IOCBs are left for scylla-nodetool or any other native tool
that might be running intermittently next to ScyllaDB.
These tool apps use the io_uring reactor backend, which requires just 2
IOCBs to function, so the configured default reserve of 10 is good for
running 5 instances of these tools next to ScyllaDB, which should be
good enough.
To be used by the tool apps -- also change the backend selected in
tools::utils::configure_tool_mode().
We keep using the more mature AIO backend in ScyllaDB itself, so main.cc
sets the linux_aio backend as the default one (the user can still change
this, same as before).
Instead of the legacy app_template::config. This allows for greater
flexibility, as any option's default can be changed this way, not just
those few that are promoted to app_template::config. This will be made
use of in the next patches.
It might happen sleep will fail during shutdown, so we should handle
failure for shutdown to proceed gracefully.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This implements the ability for the balancer to co-locate sibling
tablets on the same shard. Co-location is low in priority, so
regular load balancer is preferred over it. Previous changes
allowed balancer to move co-located sibling tablets together,
to not undo the co-location work done so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The reason we need it is that tablet merge can only be finalized
when the cluster agrees on the feature, otherwise unpatched
nodes would fail to handle merge finalization, potentially
crashing.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
set_resize_plan() breaks commutativity since it may override the
resize plans done earlier, for example, when adding co-location
migrations in the DC plan.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When handling merge completion, compaction groups that belonged to
sibling tablets are placed into the same storage group, since those
tablets become one after merge.
In order to merge two groups, the source group needs its memtable to
be flushed first, such that all the data can be moved into the
destination.
The handling happens in update_effective_replication_map() which cannot
afford to wait for I/O, so the group merge will happen in background.
There's a fiber that will wake up on merge completion and will iterate
through the new set of storage groups (after merge), and will work
on merging additional compaction groups into the main one.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This implements the ability to resize the tablet map for merge if
the balancer emits the decision to finalize the merge when all
sibling replicas are colocated for a table. But the co-location
plan is not implemented in the balancer yet, so this is still
not in use.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This transition state will be reused by merge completion, so let's
rename it to tablet_resize_finalization.
The completion handling path will also be reused, so let's rename
functions involved similarly.
The old name "tablet split finalization" is deprecated but still
recognized and points to the correct transition. Otherwise, the
reverse lookup would fail when populating topology system table
which last state was split finalization.
NOTE:
I thought of adding a new tablet_merge_finalization, but it would
complicate things since more than one table could be ready for
either split or merge, so you need a generic transition state
for handling resize completion.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The initial_tablet_count is respected while the table is in "growing mode".
The table implicitly enters this mode when created, since we expect the
table to be populated thereafter.
We say that a table leaves this mode if it required a split above the initial
tablet count. After that, we can rely purely on the average size to say that
a table is shrinking and requires merge.
This is not perfect and we may want to leave the mode too if we detect
the table is shrinking (or even not growing for some significant amount
of time), before any split happened.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If table is undergoing merge, co-located replicas of sibling tablets will
be treated by balancer as if they were a single migration candidate.
The reason for that is that the balancer must not undo the co-location work
done previously on behalf of merge decision.
Sibling tablets will be put in the same migration plan, but note that
each tablet is still migrated independently in the state machine.
The balancer will exclude both co-located tablets from the candidate list
if either haven't finished migration yet. It achieves that by pretending
migration of sibling tablets succeeded, allowing it to note that tablets
are co-located even though either can still be migrating.
The load inversion convergence check also happens after picking a
candidate now, since the balancer must be aware that co-located
tablets are being migrated together and we want to avoid oscillations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We have a check that moving a tablet from A to B won't violate
replication constraints.
The contraints might not be the same for two sibling tablets that have co-located
replicas.
Example:
nodes = {A, B, C, D}
tablet1 = {A, B, C}
tablet2 = {A, B, D}
viable target for {tablet1, B} is D.
viable target for {tablet2, B} is C.
When co-located replicas share a viable target, then a migration can be emitted to
preserve co-location.
To allow decommission when co-located replicas don't share a viable target, a skip
info will be returned for each tablet, even though that means breaking this
co-location. Decommission is higher in priority.
Also, doing some preparation for integration of migration_tablet_set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The load inversion convergence check should be able to know when two
tablets are being migrated instead of one, to avoid oscillations.
This will be wired when migration_tablet_set is wired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new type will allow the load balancer to treat co-located tablets
as a single candidate (will treat them as if they were already merged),
allowing co-located replicas to be migrated together (in the same
migration plan).
The type is a variant of global_tablet_id and colocated_tablets
(which holds the global_tablet_id of the sibling tablets).
It will be eventually wired after some more preparation. It will allow
for minimal amount of changes in the balancer code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Adding interface to iterate through sibling tablets for a given table,
one pair at a time.
Initially I thought of having for_each_sibling_tablet do nothing for single
tablet tables. But later I bumped into complications when wiring it into
load balancer for building candidate list, since single-tablet tables
have to be special cased.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Know whether resize (e.g. split) decision was needed above initial tablet
count will be helpful for guiding the merge decision, since we don't
want a merge to happen while table is still growing, but hasn't left
the merge threshold yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extraction will make it easier later when co-located tablets
are introduced in load balancer.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will help visualizing co-location of sibling tablets for a table
that is undergoing merge.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).
The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.
Fixes: scylladb/scylladb#6403
perf-simple-query -- smp 1 -m 1G output
Before:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41291 insns/op, 24485 cycles/op, 0 errors)
62669.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41277 insns/op, 24695 cycles/op, 0 errors)
69172.12 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41326 insns/op, 24463 cycles/op, 0 errors)
56706.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41143 insns/op, 24513 cycles/op, 0 errors)
56416.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41186 insns/op, 24851 cycles/op, 0 errors)
throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70
After:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 40733 insns/op, 23145 cycles/op, 0 errors)
59283.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40624 insns/op, 23948 cycles/op, 0 errors)
70851.03 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40625 insns/op, 23027 cycles/op, 0 errors)
70549.61 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40650 insns/op, 23266 cycles/op, 0 errors)
68634.96 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40622 insns/op, 22935 cycles/op, 0 errors)
throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/
Tested mixed cluster manually.
"
* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
group0: drop unused field from replace_info struct
test: rename raft_address_map_test to address_map_test and move if from raft tests
raft_address_map: remove raft address map
topology coordinator: do not modify expire state for left/new nodes any more in raft address map
topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
group0: drop raft address map dependency from raft_rpc
group0: move raft_ticker_type definition from raft_address_map.hh
storage_service: do not update raft address map on gossiper events
group0: drop raft address map dependency from raft_server_with_timeouts
group0: move group0 upgrade code to host ids
repair: drop raft address map dependency
group0: remove unused raft address map getter from raft_group0
group0: drop raft address map from group0_state_machine dependency since it is not used there any more
group0: remove dependency on raft address map from group0_state_id_handler
gossiper: add get_application_state_ptr that searches by host_id
gossiper: change get_live_token_owners to return host ids
view: move view building to host id
hints: use host id to send hints
storage_proxy: remove id_vector_to_addr since it is no longer used
db: consistency_level: change is_sufficient_live_nodes to work on host ids
...
This patch adds three basic tests for delete item. A simple one that
validate that a simple short delete item returns 1 WCU.
The second tries to delete a missing item.
The third stores a bigger item and use the ReturnValues='ALL_OLD' to
make the API gets the previous stored item and see that the WCU is as
expected.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Calculating the item length of WCU deleted Item depends on how the
operations was performed.
In a simple scenario it would be consider a 1 byte.
With an unsafe Read-Before-Write the item is return by get_perious_item
and with LWT the item is get from the apply method.
This patch changes the calls to describe_single_item in the last two
scenarios so that they would use the read item to determine the item
length.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Actions in rmw_operation can use describe_item to determine to get an
existing value (Read before Write scenario) on those cases the existing
item size can be bigger than the one we are storing (in the extreme
case, when deleting an object we only have its keys)
This modify the describe_item API so it would take a pointer to uint
instead of the consumed_capacity_counter so we can use it to get the old
value size and depends on that, determine the size that will be used for
the WCU calculation.
rmw operations needs to be able to modify consume_capacity total_bytes
directly.
Depends on the previous stored item the length on which the WCU will be
calculated can be different than the length of the operation.
This patch makes the total_bytes public so it will be possible to modify
it directly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch modify the post_item WCU test to validate that it uses the
right ops. Note that the test will pass even before this change but we
want to validate the extra label.
This patch add visibility to the WCU metrics. It uses a label 'ops' to
split each of the operations that contribute to WCU into their
operations.
When summing over all ops value the result will be the same.
Currently, the tri-compare operator for big_decimal (operator <=>), uses
a precise but potentially very expensive algorithm for comparing the
numbers: it first brings them to the same scale, then compares the
normalized unscaled values. big_decimal has abritrary precisions,
therefore the stored numbers can be arbitrarily large.
In extreme cases, comparing two numbers can result in huge amount of
memory allocated and stalls. If this type is used int he primary key of
a table, these comparisons can make the node completely unresponsive.
This patch adds the following fast-paths to operator <=>:
* An early return for the case of equal scales.
* An early return for different signs.
* An early return for the case where one or both of the numbers are 0.
* A fast algorithm for detecting the case where the there is a big
difference between the two numbers. This algorithm works only with the
scales and is able to compare the two numbers by using only one division
and some additions and substractions. This algorithm is imprecise and
when the numbers are closer than its confidence window, it will
fall-back to the current slow but precise tri-compare.
All but the last case should have been fast before as well, but the
scale-compare algorithm makes a huge difference. Numbers, which would
previously make the node unresponsive, now compare in constant-time.
Fixes: scylladb/scylladb#21716Closesscylladb/scylladb#21715
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.
Closesscylladb/scylladb#21713
* github.com:scylladb/scylladb:
topology_coordinator: introduce reload_count in topology state and use it to prevent race
storage_service: use conditional_variable::when in co-routines consistently
Similarly to `maybe_update_per_service_level_params`, the method update
connection's params but it gets `service_level_options` as an argument
instead of asking `service_level_controller`.
The method is a synchronous equivalent of
`find_effective_service_level`.
It uses recently introduced effective service level cache, so retrieve
user's effective service level is done by quick lookup to the cache.
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21754
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.
in this change, we:
- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
`std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
to feed `std::views::transform()` a view range.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
limitations:
there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21700
This commit addresses inconsistent spelling annotations that triggered
codespell warnings in our codebase.
Problem:
- Previous annotations like "CREATEing" and "DROPing" were flagged as
misspellings by the codespell workflow
- These annotations were used to describe CQL statement execution contexts
Solution:
- Updated annotations to "CREAT'ing" and "DROP'ing"
- Preserves the intent of the original annotations
- Silences codespell warnings without changing the underlying meaning
- Ensures consistent and spell-checker-friendly code documentation
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21741
Add tablet task manager module and keep it in storage_service.
Introduce tablet_virtual_task that covers tablet repair.
Thanks to a repair virtual task, a user can check the list of pending
repairs, get the status of a specific repair, or abort it using the task
manager API.
Fixes: #21368.
No backport, new feature
Closesscylladb/scylladb#21624
* github.com:scylladb/scylladb:
test: add test to check tablet repair tasks
test: topology_tasks: enable tablets
service: keep tablets module in storage_service
service: rename storage_service::_task_manager_module
service: add tablet_virtual_task
tasks: utilize preliminary virtual task lookup
In commit 8bf62a0 we introduced a test/pytest.ini which affects every
run of pytest in the project. One specific line in that file
log_cli = true
Overrides pytest's standard CLI output, which is traditionally short
unless the "-v" (verbose) option is used, to be always long and spammy.
There is absolutely no reason to do that - if the user wants to run
"pytest -v", they can do that - it doesn't need to be the default.
Moreover, as https://docs.pytest.org/en/stable/how-to/logging.html
explains, the "log_cli = true" was added in pytest 3.4 to revert to
pytest 3.3 behavior that "community feedback" showed was NOT LIKED.
Why would we want to revert to behavior that wasn't liked?
After this patch, which removes that line, the output of commands
like
cd test/cqlpy; pytest
return to what they used to be before commit 8bf62a0 and what the
pytest developers intended. Users who like verbose output can use
"pytest -v".
Fixes#21712
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21717
$without_systemd_check is incorrect variable name, it should be
$skip_systemd_check.
The bug skips to run "systemctl --user daemon-reload" unexpectedly on
nonroot mode installation.
This is likely root cause of the issue #21720.
Fixes#21720Closesscylladb/scylladb#21747
We get rid of the default switch case in the function because it's not
necessary. It's better to get a warning from the compiler if the switch
is nonexhaustive and possibly prevent a bug (operating on a null pointer
may often lead to undefined behavior).
Fixes#20717
Enables abortable interface and propagates abort_source to
all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain
a per-shard abort source proxy for each, and do a background
per-shard abort on abort call. This is synced at the end of "run()"
v2:
* Simplify abortability by using a function-local gate instead.
Adds optional abortable source to "readable_file" interface.
Note: the abortable aspect is not preserved across a "dup()" call
however, since these objects are generally not used in a cross-shard
fashion, it should be ok.
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.
Fixes#20030
This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.
Closesscylladb/scylladb#21582
* github.com:scylladb/scylladb:
compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
compaction_group: rename `update_main_sstable_list_on_compaction_completion`
compaction_group: update maintenance sstable set on scrub compaction completion
compaction_group: store table::sstable_list_builder::result in replacement_desc
table::sstable_list_builder: remove old sstables only from current list
table::sstable_list_builder: return removed sstables from build_new_list
In commit 9ff9cd37c3 we added in
test/alternator/test_number.py a workaround for a boto3 bug that
prevented us (and still prevents us) from testing numbers with high
precision. Because the workaround was so bizarre, the three lines it
requires - two imports and an assignment - were preceded by a 5-line
comment explaining it.
Unfortunately, a later commit 93b9b85c12
went and arbitrarily moved import lines around to satisfy some PEP-8
"requirements", resulting in the comment being separated from the lines
it was supposed to explain.
This patch moves the comment in front of the main line it explains.
The two imports that are needed just for this line and aren't used
elsewhere remain in their current place (where the PEP8 police demands
they stay), but this is less important for the understanding of this
trick so it's fine.
No functionality of the test was changed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21635
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.
This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2
Fixes: scylladb/scylladb#21096Closesscylladb/scylladb#21609
Raft address map is not use any longer to resolve addresses anyway, so
drop dependency on it from raft_ip_address_updater and rename it to
reflect that it is no longer raft address map specific.
It is only needed to translate id to ip in the log output, but there
is no point in doing so now. All the logging (in the converted code)
is id based now.
RPCs from old nodes will still use old format so translation will be
used in this case. The change is backwards compatible thanks to RPC
extensibility.
Also remove forcing of the replacing node to be alive which is not
needed any more since gossiper no longer inhibits replacing nodes from
advertising themselves.
The parameter was needed when nodes were addressed by IP, so during
replace with the same IP a new node had to "hide" itself from the
cluster to not get accidentally confused with the old node. Now,
when nodes are addressed by host id the situation is impossible.
In this rather large path we mode to address nodes in storage proxy by
host ids instead of ips. Some subsystems storage proxy calls to are
not yet converted to host ids, so we translate back and forth when we
interact with them.
Currently the locator::topology object, when created, does not contain
local node, but it is started to be used to access local database. It
sort of work now because there are explicit checks in the code to handle
this special case like in topology::get_location for instance. We do not
want to hack around it and instead rely on an invariant that the local
node is always there. To do that we add local node during
locator::topology creation. There is a catch though. Unlike with IP host
ID is not known during startup. We actually need to read from the
database to know it, so the topology starts with host ID zero and then
it changes once to the real one. This is not a problem though. As long as
the (one node) topology is consistent (_cfg.this_host_id is equal to the
node's id) local access will work.
Local replication strategy returns zero host id in replica set instead
of the real one. It mostly works now because code that translates ids
to ips knows that zero host id is a special one. But we want to use host
ids directly and we need to return real one (or handle zero special case
everywhere).
What wait_for_ip is actually does is waiting for a node to appear in the
gossiper since this is when it is added to the raft address map. Drop
the usage of the address map and check the gossiper directly.
Now we have enough functionality in the gossiper and messaging service
to get rid of ip2id function in the topology coordinator. We can use
hos ids directly.
If an RPC client creation was triggered by send function that has host
id as a dst send it as part of CLIENT_ID RPC which is always the first
RPC on each connection. If receiver's host id does not match it will
drop the connection.
Since we dropped scylla-jmx at 3cd2a61, Wants=scylla-jmx.service is not
needed anymore.
Also we have issue on nonroot mode installation with this line (#21720),
we need to drop this now.
Fixes#21720Closesscylladb/scylladb#21721
The function checks if a node with provided id is alive. If it fails to map id to ip
or there is no state for the ip found the node is considered to be dead.
The function looks up provided host id in gossip_address_map and throws
unknown_address if the mapping is not available. Otherwise it sends the
message by IP found.
We want to be able to address nodes by host ids. For that lets generate
send functions that gets host_id as a dst parameter.
Changes to raft_rpc are needed because otherwise the compiler cannot
select a correct overload.
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.
Fixes: scylladb/scylladb#19994
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.
Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency
Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions
Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios
Fixesscylladb/scylladb#21724
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21726
Task status information from nodetool commands is not retained permanently:
- Status of completed tasks is only kept for `task_ttl_in_seconds`
- Status is removed after being queried, making it a one-time operation
This behavior is important for users to understand since subsequent
queries for the same completed task will not return any information.
Add documentation to make this clear to users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21386
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include `seastar/coroutine/maybe_yield.hh` anymore, and quite a few source files were relying on this header to bring in the declaration of `maybe_yield()`, we have to include this header in the places where this symbol is used. the same applies to `seastar/core/when_all.hh`.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21727
* github.com:scylladb/scylladb:
.github: add "mutation" to CLEANER_DIR
mutation: remove unused "#include"s
in order to prevent future inclusion of unused headers, let's include
"mutation" subdirectory to CLEANER_DIR, so that this workflow can
identify the regressions in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Demote --scylla-data-dir and --scylla-yaml-file to schema source
helpers, rather than schema source in themselves. This practically means
that when these options are used, they won't define where the tool will
attempt to load the schema from, they will just be helpers to help locate
the schema, for whichever schema source the tool was instructed to use
(or left to choose).
--scylla-data-dir and --scylla-yaml-file being schema sources were
problematic with encryption at rest and for S3 support (not yet
implemented). With encryption, the tool needs access to the
configuration, so --scylla-yaml-file is often used to provide the path
to the configuration file, which contains encryption configuration,
needed for the tool to decrypt the sstable. Currently, using this option
implies forcing the tool to read the schema from the schema tables,
which is a problematic option for tests -- Scylla might be compacting a
schema sstable and this will make the tool fail to load the schema.
Demoting these options the schema helpers, allows providing them, while
at the same time having the option to use a different schema-source.
To allow the user to force the tool to load the schema from the schema
tables, a new --schema-tables option is added. Similarly, a
--sstable-schema option is introduced to force the tool to load the
schema from the sstable itself.
With this, each 4 schema source now has an option to force the use of
said schema source. There are various helper options to be used along
with these.
The documentation as well as the tests are updated with the changes.
The schema related documentation gets an rather extensive facelift
because it was a bit out-of-date and incomplete.
Fixes: scylladb/scylladb#20534Closesscylladb/scylladb#21678
This change improves dependency management by explicitly specifying
library linkage visibility in CMake targets.
Previously, some ScyllaDB targets used `target_link_libraries()`
without `PUBLIC` or `PRIVATE` keywords, which resulted in transitive
library dependencies by default. This unintentionally exposed
non-public dependencies to downstream targets.
Changes:
- Always use explicit `PRIVATE` or `PUBLIC` keywords with
`target_link_libraries()`
- Tighten build dependency tree
- Enforce a more modular linkage model
See: [CMake documentation on library dependencies](https://cmake.org/cmake/help/latest/command/target_link_libraries.html)
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21686
This commit fixes a problem with way truncate saves commit log
replay positions. On shards without mutations, truncate would save the
replay position into system.truncated with shard number 0 regardless of
the actual shard number that the replay position was saved for.
Once e.g. `ALTER KEYSPACE` is performed, all in-memory objects should be updated accordingly, but this is not entirely true for keyspace metadata object. The reason for that is that keyspace metadata are stored in 2 system tables: `system_schema.keyspaces` and `system_schema.scylla_keyspaces`. Up until now the in-memory keyspace metadata object has been updated only with entries from the first table, and missed updates when entries from the 2nd table changed. These entries were e.g. initial tablets or storage options.
This change fixes this oversight by considering both tables when checking if keyspace metadata need to be updated. From the implementation point of view, the change is simple: we're considering `system_schema.scylla_keyspaces` also in `merge_keyspaces()` and if old and new schemas have any differences, we include that when altering ks.
Fixes#20768
Backport: no need, I don't think the issue is severe, atm it seems like it can only influence the tablets number, which should not bring the cluster down nor result in returning bad data, it can mostly influence the speed of the db.
Closesscylladb/scylladb#20852
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.
This patch also adds the topology state machine notification when a node
is up.
The checksummed file data source uses the chunk size to enforce that the
reads from the underlying file input stream will be aligned at the chunk
boundary. This is necessary so that we can validate the checksum of each
chunk.
However, a mismatch in the numeric types caused a bug where the
underlying file input stream would read a smaller portion of the data
file than expected.
The bug is located in the following lines:
```
auto start = _beg_pos & ~(chunk_size - 1);
auto end = (_end_pos & ~(chunk_size - 1)) + chunk_size;
```
`_beg_pos` and `_end_pos` are `uint64_t`, whereas `chunk_size` is
`uint32_t`. When executing the AND operation, the compiler converts the
right operand from `uint32_t` to `uint64_t`. Since the integer is
unsigned, the four most-significant bytes are filled with zeros, thus
erroneously truncating the corresponding bytes of the position.
Fix the bug by explicitly converting the chunk size to `uint64_t` before
any arithmetic operations. Also, replace the handwritten alignment
implementations with the `align_up()` and `align_down()` helpers.
Finally, restrict the file end position to not exceed the file length.
Since the last chunk can be smaller than the chunk size, it could happen
that the end position exceeds the file length after the round-up. This
is not a bug on its own since `make_file_input_stream()` can accept
lengths that go beyond end-of-file, but still it makes the code more
error prone and should be avoided.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#21665
Tablets are no longer an experimental feature, but topology_tasks test
suite treats them as if they were.
Enable tablets with their own config option in topology_tasks suite.
Rename storage_service::_task_manager_module to _node_ops_module.
In the following patches, storage service will keep two different
task manager modules.
Before these changes, we dereferenced `app_state` in
`manager::endpoint_downtime_not_bigger_than()` before checking that it's
not a null pointer. We fix that.
Fixesscylladb/scylladb#21699Closesscylladb/scylladb#21676
When API user requests status of a virtual task, we first need to find
which virtual_task instance tracks given operation. While doing this we
gather some info regarding the task, but we don't utilize it.
Add virtual_task_hint that keeps info that was gathered during virtual
task lookup and pass it to virtual_task's methods so the info doesn't
need to be retrieved twice.
Previously, the progress of download_task_impl launched by the "restore" API
was not tracked. Since restore operations can involve large data transfers,
this makes it difficult for users to monitor progress.
The restore process happens in two sequential steps:
1. Open specified SSTables from object storage
2. Download and stream mutation fragments from the opened SSTables to
mapped destinations
While both steps contribute to overall progress, they use different units
of measurement, making a unified progress metric challenging. Because
the load-and-stream step (step 2) is the largest time-consuming part of the
restore. This change implements progress tracking for this step as an
initial improvement to provide users with partial visibility into the
restore operation.
Fixesscylladb/scylladb#21427
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
---
this a part of experimental feature, hence no need to backport.
Closesscylladb/scylladb#21562
* github.com:scylladb/scylladb:
test/object_store: Enable tablets to match production settings
sstables_loader: Track download progress of download_task_impl
sstables_loader: improve batch tracking using ranges library
sstables_loader: print streaming progress with moving range
sstables_loader: mark sstable_streamer::stream_sstable_mutations() private
sstables_loader: fix indentation in stream_sstable_mutations()
The "--use-cmake" option currently hardwires the build directory as
"$source_dir/build". Adhere to the "--build-dir" option's argument
instead:
- If the option is not specified, its argument defaults to "build"; thus,
there is no change in behavior.
- If the option specifies a relative pathname, append it to $source_dir.
- If the option specifies an absolute pathname, use it as-is.
This is especially useful for keeping the build directory on a filesystem
separate from the source directory (without resorting to creating "build"
as a symlink, before running "configure.py"). For example, the source tree
can be accessed remotely over sshfs, from a build host, while keeping the
build artifacts (and hence the link stage) local to the build host.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#21694
tombstone_gc.hh is relatively lightweight and is used in many places,
but it includes the heavyweight boost/icl/interval_map.hh. Lighten
the load for its users by wrapping lw_shared_ptr<some icl map type>
in a forward-declared class. Define the class in a new header
tombstone_gc-internals.hh, to be used by the two translation units
that need it.
Ref #1.
Closesscylladb/scylladb#21706
Update the tablestats documentation to correctly describe the "Number of
partitions" metric. The previous documentation incorrectly referred to
"estimated row count" when the command actually shows estimated partition count.
Before:
```
Number of keys (estimate) | The estimated row count
```
After:
```
Number of partitions (estimate) | The estimated partition count
```
This distinction is important since a partition (identified by its partition
key) can contain multiple rows in ScyllaDB. The updated format also matches
Cassandra's nodetool output for better compatibility.
Fixesscylladb/scylladb#21586
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21598
Now that `update_sstable_sets_on_compaction_completion` can update both
the main and maintenance sets, callers of
`update_sstable_lists_on_off_strategy_completion` can replace it with
the former.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Rename `update_main_sstable_list_on_compaction_completion` to
`update_sstable_sets_on_compaction_completion` as the method updates
both main and maintenance sstable sets now.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there.
This patch modifies the `update_sstable_sets_on_compaction_completion`
to remove the input sstable from the maintenance sstable set if it
exists in that set.
Also added a testcase to verify the fix.
Fixes#20030
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Directly store the result of `build_new_list` in `replacement_desc`
instead of storing just the newly built sstable_set. Adjust the
`backlog_tracker_adjust_charges` to use the removed sstables list
returned by the `build_new_list`, so that when the next patch updates
the `update_main_sstable_list_on_compaction_completion` to also update
the maintenance sstable set, only sstables removed from main sstable set
will be removed from the backlog tracker.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The `build_new_list()` method previously joined the current and new
sstable ranges, removing old sstables from the combined result. This
patch updates the method to treat them separately, ensuring old sstables
are removed only from the current sstable list.
This change enables the method to return the correct set of removed
sstables in cases where an sstable is directly moved from the
maintenance set to the main set.
Updated the method table::sstable_list_builder::build_new_list() to
return the list of sstables that was removed along with the newly built
sstable set. This change will be used to unify the
`update_sstable_lists` variants in a following patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Enable the `enable_tablets` configuration flag in object store tests to better
align with production environments, where it is enabled by default via the
`scylla.yaml` in Scylla's relocatable tarball. This change will improve test
coverage of tablet-related features.
Previously, `enable_tablets` defaulted to false in tests, creating a mismatch
with typical production deployments.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Previously, the progress of download_task_impl launched by the "restore" API
was not tracked. Since restore operations can involve large data transfers,
this makes it difficult for users to monitor progress.
The restore process happens in two sequential steps:
1. Open specified SSTables from object storage
2. Download and stream mutation fragments from the opened SSTables to
mapped destinations
While both steps contribute to overall progress, they use different units
of measurement, making a unified progress metric challenging. Because
the load-and-stream step (step 2) is the largest time-consuming part of the
restore. This change implements progress tracking for this step as an
initial improvement to provide users with partial visibility into the
restore operation.
Fixesscylladb/scylladb#21427
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace manual vector iteration with ranges library to preserve batch size
information. When streaming SSTable mutations, we need to track progress
across batches. The previous implementation used a loop to move elements
from the vector's end, but this approach lost the batch size information
since the SSTable set was moved away during streaming.
Now use std::ranges to take elements from the vector's end instead of
manual iteration. This preserves the original batch size, enabling accurate
progress tracking which will be implemented in a follow-up commit.
Technical changes:
- Replace manual vector iteration with ranges::take_view
- Preserve batch size information for progress tracking
- Maintain existing batch processing behavior
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After d1db17d490, the processed SSTable counter remained at 0 while
streaming progress was still being displayed. This fix properly tracks
and displays streaming progress by:
- Moving SSTable counter (`nr_sst_current`) to `sstable_streamer::stream_sstables()`
- Generating UUID at the streaming initialization
- Relocating progress reporting to `stream_sstables()` for accurate tracking
This ensures the progress indicator correctly reflects the actual number
of processed SSTables during streaming operations.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the only user of `sstable_streamer::stream_sstable_mutations()` is
`sstable_st6reamer::stream_sstables()`, so mark this member function as
private.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fix indentation regression from d1db17d490 where the function body of
`sstable_streamer::stream_sstable_mutations()` was left incorrectly
indented after the function was extracted to decouple streaming from
sstable selection.
Pure style fix, no functional changes.
in this change, we correct the indent.
Refs d1db17d490
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`coroutine::parallel_for_each` accepts both a range and a pair of
iterators. let's use the former when appropriate. it is simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21684
ScyllaDB doesn't support counters with tablets yet. So scylla-sstable
tests which use counter schema are marked with xfail, but this is done
too aggressively, disabling too many tests that are otherwise fine.
There are two tests affected:
* test_scylla_sstable_script - this test uses early return when the
schema parameter is the one with counters and tablets are enabled.
This is still too eager because tablets are now always enabled. Also,
the early return make the fact that this test is disabled hidden.
So change the check to check whether tablets are used on the test
keyspace and use xfail instead of sneaky early return.
* test_scylla_sstable_dump_data - this test is blanket-disabled when run
with the tablets parameter. Even though only 1 out of 5 schemas tested
use counters. Remove the blanket xfail and only add it when test
keyspace uses tablets and the schema parameter is the one with
counters.
This makes dozens of test run again, restoring the test coverage lost
with the too eager use of xfail (and sneaky return).
Refs: #18180Closesscylladb/scylladb#21685
in 0dff187b7a, we dropped `InjectingHandler.log_message()`, but this
method was defined to override the default implementation provided by
`BaseHTTPRequestHandler.log_message()`. this change flooded the standard
output when testing `aws_error_injection_test` with `test.py` with
logging messages like:
```
127.0.0.1 - - [26/Nov/2024 17:27:34] "PUT /?Policy=0&Key=%2Ftest%2Ftestobject-large-817295 HTTP/1.1" 200
127.0.0.1 - - [26/Nov/2024 17:27:34] "PUT /?Policy=1&Key=%2Ftest%2Ftestobject-large-817306 HTTP/1.1" 200
```
this is unexpected.
in this change, we bring this method back, and additionally, we
format the logging message lazily.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21689
When users start an operation asynchronously with API, they are expected to check the operation's status. Hence, the status should be kept in task manager for reasonable time after the operation is done. The operations that are started internally usually don't need to stay in task manager for that long.
Add api_task_ttl that will be used for tasks started with API. By default it's 1 hour. The time for which non-API tasks stay in task manager isn't changed.
Fixes: #21499.
Refs: #21425.
No backport needed - previous versions may use task_ttl
Closesscylladb/scylladb#21505
* github.com:scylladb/scylladb:
test: add test to check user_task_ttl
tasks: api: move make_task method
docs: nodetool: update backup and restore commands docs
docs: update task manager docs
nodetool: add nodetool tasks user-ttl command
node_ops: use user task ttl for node ops virtual task
tasks: use user_task_ttl for tasks started by user
api: task_manager: add /task_manager/user_ttl to get and set user task ttl
tasks: add task_manager::task::is_user_task method
tasks: keep updateable_value of task_ttl in task manager
db: config: add user_task_ttl_seconds named value
* seastar a5432364...72c7ac57 (13):
> json2code: convert boost integer_range to std iota_view
> utils/program-options: selection_value: make get_candidate_names() public
> treewide: Trim trailing spaces
> build: Clean up `compile_option` after FindSanitizers module
> file: Convert file operations to use coroutines
> websocket: Remove unnecessary condition in frame parsing
> websocket: Fix logic when parsing header
> websocket: Avoid memory copy when full websocket frames are received
> websocket: Fix websocket frame parsing on partial packets
> sharded.hh: remove inline from templates
> reactor: fix reserve_io_control_blocks config name in error message
> reactor: next_waitpid_timeout: contants can be defined as constexpr
> reactor: coroutinize waitpid
Closesscylladb/scylladb#21688
Java tools are deprecated and slated for removal in the next ScyllaDB release.
Update the admin-tools docs and make sure all java tool documentation pages have a notice reflecting this fact.
Fixes: https://github.com/scylladb/scylladb/issues/21149
Should be backported to 6.2, so users of the latest stable version can see the notice.
Closesscylladb/scylladb#21522
* github.com:scylladb/scylladb:
docs: sstableloader.rst: add deprecation notice
docs: admin-tools: update deprecation notice for sstable{dump,metadata}
docs: tools_index.rst: remove deprecated sstablereset and sstablerepairedset tools
Stop taking snapshots of MVs and allow taking snapshot of individual tables, now one can take a snapshot of any base table, any view or index. Also add tests to cover new cases both boost test (using cc code) and pytest (using the API)
Also, update documentation to reflect the change
fixes: #21339fixes: #20760Closesscylladb/scylladb#21433
Mostly no functional changes here except in patch 3.
* 'gleb/cleanups' of github.com:scylladb/scylla-dev:
migration_manager: move migration manager verbs to the IDL
storage_proxy: remove unused function
storage_proxy: co-routinize handle_paxos_prepare
storage_proxy: co-routinise handle_paxos_prune
service: raft: no need to sync schema if the cluster is in raft topology mode
messaging_service: co-routinize messaging_service::stop_client
gossiper: rename apply_state_locally_without_listener_notification to apply_state_locally_in_shadow_round
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.
Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views
Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices
Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
- `std::ranges::to` adaptor requires rvalue references
- Necessitated updates to variable and parameter constness
- Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
from `common` to enable efficient range conversion
2. Range Iteration and Mutation
- Range views may mutate internal state during iteration
- Cannot pass ranges by const reference in some scenarios
- Solution: Pass ranges by rvalue reference to explicitly indicate
state invalidation
Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change
This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21648
task_manager::module::make_task method template is used only for
test_task_impl. Move it to api/task_manager_test.cc and modify it
to be test_task_impl-specific.
Following combinations of error injections and cluster events
can cause #21534. Disable them for now because they break CI.
Closesscylladb/scylladb#21658
The `sstable validate-checksums` tool provides the validation result
via the `valid_checksums` key in its JSON response.
The name can be misleading as it refers to both the per-chunk checksums
and the digest (full checksum). We use the terms "digest" and
"full checksum" interchangeably.
Replace with the word "valid" to avoid confusion.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The previous patch extended `validate_checksums()` to perform checksum
validation even if the digest component is missing. Add a test case for
this scenario.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently, `validate_checksums()` expects the SSTable to have a digest
component and fails immediately otherwise. This is suboptimal since data
integrity verification could still be carried out partially via checksum
checking.
Lift this restriction by allowing the function to perform checksum
checking in any case, and treat digest checking as best effort. Add a
separate boolean flag in the response to indicate the presence or
absence of the digest component, so that the user can deduce if a valid
result involved digest checking or not.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`validate_checksums()` is used to check the checksums and digests of an
SSTable. Currently, the procedure is ad-hoc: the helper functions
`do_validate_[un]compressed()` loop over a raw stream, calculate the
actual checksums and digest, and compare against the expected ones.
In an effort to reduce code duplication, remove the custom procedure for
uncompressed SSTables and use a checksummed input stream instead. The
checksummed input stream offers the same functionality of checksum and
digest checking transparently. Also, check if the SSTable has checksums
before creating the input stream because `data_stream()` would return a
raw stream in this case.
Although the compressed input stream offers the same checksum and digest
checks, we need to stick with the existing procedure for compressed
SSTables. The reason is that `validate_checksums()` needs to examine the
whole data file, so any failed checksum checks must be tolerated. With
checksummed streams we support that via a user-provided graceful error
handler that just logs a message and updates the validation status.
However, with compressed streams we cannot customize the error handling
logic because they return decompressed data, but decompression may fail
if applied on corrupted data.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Expose the `error_handler` parameter from the checksummed input stream.
This is a callback function that the input stream calls if an invalid
checksum or digest is encountered.
The parameter is ignored if integrity checking is disabled. It is also
ignored in case of compressed SSTables, since the compressed input
streams do not support it.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently, the checksummed data source treats an invalid checksum or
digest as an unrecoverable error by throwing a
`malformed_sstable_exception`. This does not allow to use this data
source in places where it is required to resume after a failed checksum
(e.g., in `validate_checksums()`).
Make the error handling logic customizable via a callback function.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
For uncompressed SSTables, the expected number of chunks is the number
of checksums in the CRC component. The data file must contain the same
number of chunks. Otherwise, the SSTable should be considered as
corrupted.
Add a check in the checksummed data source to ensure that the data file
does not contain more chunks than expected. Run this check every time
the caller reads more data from the stream. The check will be triggered
when they attempt to read past the expected number of chunks and more
chunks are indeed available. This behavior is consistent with the
compressed data source and allows for partial reads to succeed.
This check will not be triggered if an SSTable has been corrupted by
appending new data, but the new data do not overflow the last chunk.
Since the SSTable metadata only record the expected number of chunks, we
cannot know the exact expected file size at a byte-level. However, this
kind of corruption will be detected by the checksum check, and by the
digest check if enabled.
In fact, the checksum check would suffice for all kinds of corruption
due to appended data except for one case: when the pre-corruption data
file was aligned at the chunk boundary, i.e., the last chunk was full.
This patch closes this gap.
Finally, note that, as a side-effect, this patch fixes a bug where we
would do an out-of-bounds read on the checksum array.
This patch is part of incorporating the functionality of
`do_validate_uncompressed()` into the checksummed data source.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Uncompressed SSTables may have a last chunk that is smaller than the
chunk size. The condition for premature EOF is when more chunks are
expected when such a chunk is encountered. The expected number of chunks
is the number of checksums in the CRC component.
A premature EOF can happen if the data file has been truncated.
An edge case is when the truncation happened at exactly the chunk
boundary and before the SSTable was loaded. In this case, this check
will not be triggered because the early return statement of `get()`
will evaluate as true (`_pos` will match the `_end_pos`, which is the
actual file size). But it will be caught by the digest check.
This patch is part of incorporating the functionality of
`do_validate_uncompressed()` into the checksummed data source.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add user_task_ttl_seconds config option and keep the value in task manager.
In the following patches tasks started by user will be kept in task
manager for user_task_ttl_seconds after they are finished.
Reduce dependency load by standardizing on std::ranges. This is a little
involved since a we use a custom iterator.
Code cleanup; no backport.
Closesscylladb/scylladb#21421
* github.com:scylladb/scylladb:
locator: token_metadata: switch from boost ranges to std ranges
locator: token_metadata: make iterator support std::input_iterator concept
locator: tokens_metadata: move tokens_iterator to namespace scope
An SSTable can be corrupted by appending random data to it.
`validate_checksums()` should be able to identify such SSTables as
invalid.
Cover this with a test case.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The tests for `validate_checksums()` already cover the case of a
truncated SSTable. However, the test performs the truncation after the
SSTable has been loaded, which means that the SSTable object has cached
the old file size by the time we validate its checksums. This is a valid
case, but not the most common one.
Add a new test that loads the SSTable after the truncation. Do not use
the same SSTable as for the other tests, since this has been loaded
already.
Additionally, let both tests check SSTables with different types of
truncations: minor truncations affecting only the last chunk, and major
truncations spanning across multiple chunks.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In this PR, we improve various aspects of the test:
* increase obtained information whenever any test case fails,
* split test cases,
* elaborate on the semantics of generating view updates
and what exactly we check and why.
Backport: not needed, this is an enhancement.
Closesscylladb/scylladb#21579
* github.com:scylladb/scylladb:
test/boost/view_schema_test: Improve comments in test_view_update_generating_writetime
test/boost/view_schema_test.cc: Improve checks in test_view_update_generating_writetime
test/boost/view_schema_test.cc: Split test cases in test_view_update_generating_writetime
The non local strategy system keyspaces usually contain very litte data.
All the tables within them have to be repaired for all the token ranges,
which could be large in clusters with a large number of nodes. In multiple
DC setup, the repair in RBNO is dominated by the network latency. As a
result, it takes a long time to repair those tables even if they are
almost empty.
To speed up the RBNO bootstrap, especially for starting empty clusters,
this patch enables small table optimization for RBNO for system tables.
We could enable it for small user tables as a follow up.
Tests:
1) A 5ms latency is added to simulate cross dc network delay, 256 tokens
per node, 10 nodes:
- Before
topology_custom dev topology_custom.test_boot_time.1 1287.06s
- After
topology_custom dev topology_custom.test_boot_time.1 12.48s
The test shows 100X boot time improvement
2) A SCT test to bootstrap 3 DCs, 3 nodes in each DC.
- Before
Time to bootstrap = 1h23m
- After
Time to bootstrap = 13m
The test shows 6X bootstrap time improvement
Fixes#19131
New feature. No backport is needed.
Closesscylladb/scylladb#21207
* github.com:scylladb/scylladb:
repair: Enable small table optimization for RBNO bootstrap and decommission
repair: Move flush_rows after repair_meta class
I reread the "ScyllaDB Alternator for DynamoDB users" document
(alternator/compatibility.md) and improved various places that I thought
needed improvement.
Two of the more significant changes is moving the not-really-important
"Scan ordering" section much lower in the document and explaining it
better, and improving the "provisioning" section to focus on the available
and missing functionality, and not on minor API details.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21605
With C++23, we have ranges available. Migrate the code in `readers/` to use ranges, to reduce our dependency on boost (and modernize the code a bit).
Improvement, no backport required
Closesscylladb/scylladb#21643
* github.com:scylladb/scylladb:
readers/mutation_reader: migrate to std::ranges
readers/multishard: migrate to std::ranges::{push,pop}_heap()
readers/combined: migrate to std::ranges::subrange<>
readers/combined: migrate to std::ranges::{push,pop}_heap()
We run topology_random_failures in debug mode only and sometimes Scylla
is too slow in this mode. Increase timeout for Scylla startup from
30s to 180s to reduce flakiness.
Fixes#21101Closesscylladb/scylladb#21659
The previous commit (b3ebbf35e2) transformed `make_sstables_available()`
into a coroutine but left behind incorrectly indented statements
from a nested lambda. This commit restores proper indentation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21660
The non local strategy system keyspaces usually contain very litte data.
All the tables within them have to be repaired for all the token ranges,
which could be large in clusters with a large number of nodes. In multiple
DC setup, the repair in RBNO is dominated by the network latency. As a
result, it takes a long time to repair those tables even if they are
almost empty.
To speed up the RBNO bootstrap, especially for starting empty clusters,
this patch enables small table optimization for RBNO for system tables.
We could enable it for small user tables as a follow up.
Tests:
1) A 5ms latency is added to simulate cross dc network delay, 256 tokens
per node, 10 nodes:
- Before
topology_custom dev topology_custom.test_boot_time.1 1287.06s
- After
topology_custom dev topology_custom.test_boot_time.1 12.48s
The test shows 100X boot time improvement
2) A SCT test to bootstrap 3 DCs, 3 nodes in each DC.
- Before
Time to bootstrap = 1h23m
- After
Time to bootstrap = 13m
The test shows 6X bootstrap time improvement
Fixes#19131
In this commit, we elaborate on the semantics of generating view updates
for each case the test goes through so that the reader less familiar
with the logic has an easier time understanding it.
We modify the checks in the test to obtain full information whenever
a failure happens. Before this change, we compared the number of view
updates one-by-one. As a result, when the first check failed, we didn't
learn anything about the other two. Now we always compare them all at
once.
A negative impact of this commit is that if one of the lambdas throws
an exception, we don't learn ANYTHING. However, a lambda throwing
an exception is a more appalling problem than the comparison failing,
and we DO learn about it in such a situation; so we accept that cost.
We split some of the test cases so it's clearer what's going on in the
test. Also, if a bug happens in the future, it should be easier to
reason about it when it corresponds to exactly one CQL statement
instead of possibly two.
Read and Write Consumed Capacity units are an abstract way of measuring Alternator actions. In general, they correspond to the read or write data.
In the long run, the RCU/WCU adds a way of charging an operation and limiting usage.
This series addresses two issues: consume capacity request API and metering.
The Alternator (and DynmoDB) API has an optional parameter allowing users to check the number of units an operation consumes. When a user adds that parameter, the response will contain the number of units used for the operation.
This series adds the consume capacity support to the get_item and put_item, adds a metric to collect the overall RCU and WCU used, and adds a test for the new functionality.
Follow-up PRs will add support for more operations and GSI.
Replaces #19811
Partially implement: #5027Closesscylladb/scylladb#21543
* github.com:scylladb/scylladb:
alternator/test_metrics: Add tests for table consumption units
test_returnconsumedcapacity.py: Add putItem tests
Alternator: add WCU support
Add test/alternator/test_returnconsumedcapacity.py
alternator/executor: Add consume capacity for get_item
alsternator/stats: Add rcu and wcu metrics to stats
alternator/executor.hh: white-space cleanup
Add the consume_capacity helper class
The semantics of Scylla's materialized views may vary depending on how their
primary keys correspond to the base table's one. One of the differences is
how we handle writes to columns in the base table that are not selected by
a view:
* Case 1: The view's PK is a permutation of the base table's PK:
Since the view's primary key cannot be changed in an update, a row in
the view remains alive as long as the corresponding row in the base table
is alive.
The tricky part comes when the base table has columns that are NOT selected
by the view. CQL3 used to not allow for defining a table that didn't have
any other columns besides its primary key. Also, when inserting a row into
a table, it was mandatory to provide at least one value aside from the
primary key. At some point it changed [1] and the implementation of the
solution relied on the notion of the row marker.
Putting the details aside, consider the following scenario:
(i) the base table has a primary key consisting of columns
c_1, ..., c_k, and it has regular columns rc_1, ..., rc_n,
(ii) the primary key of an MV defined on that table consists of
a permutation of c_1, ..., c_k. The MV doesn't select at least
one of the regular columns of the base table. Without loss of
generality, let that unselected column be rc_1.
(iii) the base table has a row R whose only non-null value is the one
in the regular column rc_1.
Now, what will R correspond to in the MV? The base table doesn't have a row
marker, but all of its regular columns in the MV will be NULLs. That's NOT
allowed.
To solve that problem, all unselected columns have corresponding virtual
columns in the MV; the only information they provide is whether there is
a value in the base table or not. This way, the MV knows if a row is still
alive or not.
For that reason, we send view updates to virtual columns in the following
cases:
(i) the value in the column changes from NULL to a value, i.e. it's
created,
(ii) the value in the column exists, but its TTL has been updated.
* Case 2: The view's PK has one more column that the base table's one:
Since the primary key of the view has a regular column C from the base
table, it is guaranteed that if there's a row in the MV, the corresponding
row in the base table can remain alive: since C is part of the view's PK,
it must have a value, so the row in the base table has a value in C too.
The problem with virtual columns from the previous case doesn't manifest
in this one. The liveness of the cell in C determines the liveness of
the whole row in the view.
The semantics gets more complex, but the conclusion is this: in case 1,
virtual columns exist and we may need to generate view updates for them,
while in case 2 virtual columns do NOT exist and so we don't generate
view updates for them.
What changes in this patch is we adjust the code to it. If a view has
a regular column from the base table as part of its primary key, we
no longer emit view updates when we change a column unselected by that
view. It is purely an OPTIMIZATION change.
[1]: https://issues.apache.org/jira/browse/CASSANDRA-4361Fixesscylladb/scylladb#21652Closesscylladb/scylladb#21653
Schema of system tables is defined statically and table_schema_version needs to be explicitly set in code like this:
```
builder.with_version(system_keyspace::generate_schema_version(table_id, version_offset));
```
Whenever schema is changed, the schema version needs to change, otherwise we hit undefined behavior when trying to interpret mutation data created with the old schema using the new schema.
It's not obvious that one needs to do that and developers often forget to do that. There were several instances of mistakes of omission, some caught during review, some not, e.g.: 31ea74b96e.
This patch changes definitions to call the new `schema_builder::with_hash_version()`, which will make the schema builder compute version from schema definition so that changes of the schema will automatically change the version. This way we no longer rely on the developer to remember to bump the version offset.
All nodes should arrive at the same version, which is verified by existing `test_group0_schema_versioning` and a new unit test: `test_system_schema_version_is_stable`.
Closesscylladb/scylladb#21602
* github.com:scylladb/scylladb:
system_tables: Compute schema version automatically
schema_builder: Introduce with_hash_version()
schema: Store raw_view_info in schema::raw_schema
schema: Remove dead comment
hashing: Add hasher for unordered_map
hashing: Add hasher for unique_ptr
hashing: Add hasher for double
[avi: add missing include <memory> to hashing.hh]
This patch introduces the skip_when_empty flag to all CQL counters that
previously lacked this setting.
The skip_when_empty flag is a metric optimization that prevents
reporting on counters that have never been used. Once a counter has been
used (i.e., it holds a positive value), it will continue to be reported
consistently from that point onward.
Fixes#21046
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closesscylladb/scylladb#21565
Schema syncing during group0 joining is needed during upgrade from a
cluster without raft to one that will be managed by raft, but if
topology cmd is enabled by cluster during group0 join it means that the
cluster is already in the raft mode and the schema sync can be safely
skipped.
instead of making a copy of the warnings vector, make the warnings a non const in prepared_statement
and move the warnings vector to execute_maybe_with_guard
Closesscylladb/scylladb#20361Closesscylladb/scylladb#21083
The java tools (including sstableloader) are deprecated and slated for
removal in the next ScyllaDB release. Add a notice about this to the
sstableloader page.
These two tools already have a deprecation notice, since ScyllaDB 5.4.
Now we have a target release for the actual removal of these tools, so
update the deprecation notice to reflect that.
Theset tools were unused and one of them doesn't even work, as ScyllaDB
doesn't have incremental repair implemented.
We are deprecating the java tools in the next release so drop these from
the list. Since they don't even have a page of their own, they don't get
a deprecation notice like the other tools in this PR.
More gossiper cleanups that accumulated since the previous one.
* 'gleb/more-gossip-cleanup-v2' of github.com:scylladb/scylla-dev:
gossiper: replace milliseconds with seconds where appropriate
gossiper: simplify failure_detector_loop loop a bit
gossiper: use fmt library to format time
gossiper: drop on_success callback from mutate_live_and_unreachable_endpoints
gossiper: remove code duplication between shadow round and regular path when state is applied
gossiper: remove remnants of old shadow round
gossiper: fix indentation after the last patch
gossiper: co-routinize do_shadow_round
Remove the `--python` option which was originally added in 780d9a26b2 to support
CentOS's non-standard python3 path (`/usr/bin/python3.4`).
Since we now:
- Build using a Fedora-based container with standard python3 path
- Use properly configured shebangs in build scripts
- Set correct executable permissions on Python scripts
This change:
1. Removes the `--python` command line option
2. Updates build rules to execute Python scripts directly instead of via interpreter
This simplifies the build system and reduces differences between CMake and
configure.py-generated rules.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21607
std::ranges::{push,pop}_heap() will only generate default comparator if
the compared types are fully ordered. So we need to pass std::less<>
explicitely as comparator for the code to compile.
As reported by @Deexie, during the process of opening backport PRs in https://github.com/scylladb/scylladb/pull/21616, No invite emails were sent, causing a lack of permissions for the backport PR branch
The check if `has_in_collaborators(pr.user.login)` was pointing to `scylladb/scylladb` instead of `scylladbbot/scylladb`, fixing it
I also moved the collaborator check to an early stage, before trying to open a backport PR
Closesscylladb/scylladb#21645
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.
The current repair scheduler implementation does the following:
- A tablet is picked to be repaired when is requested by user
- The tablet repair can be scheduled along with tablet migration and
rebuild. It runs in the tablet_migration track.
- Repair jobs are scheduled in a smart way so that at any point in time,
there are no more than configured jobs per shard, which is similar to
scylla manager's control.
New feature. No backport is needed.
Closesscylladb/scylladb#21088
* github.com:scylladb/scylladb:
test: Add tests for tablet repair scheduler
repair: Add restful API for tablet repair
repair: Add tablet repair scheduler internal API support
docs: Update system_keyspace.md for tablet repair related info
docs: Add docs for tablet repair migration
repair: Add core tablet repair scheduler support
messaging_service: Introduce TABLET_REPAIR verb
tablet_allocator: Introduce stream_weight for tablet_migration_streaming_info
network_topology_strategy: Preserve fields of task_info in reallocate_tablets
From boost::iterator_range<>. One return in maybe_produce_batch() had to
be adjusted because it used a strange initialization of
boost::iterator_range<>, which should not even had compiled.
Adding tests to verify the RCU and WCU metrics.
A new helper function check_increases_metric_exact check that a given
metrics increased by a given number.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds testing for putItem consume capacity.
There is an additional test for number support. Numbers are encoded
differently with alternator and dynamoDB, the test adds some flexibility
in the result so it would pass both DynamoDB and Alternator.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
now that we are allowed to use C++23. we now have the luxury of using `std::ranges::find_if`.
in this change, we:
- replace `boost::find_if` with `std::ranges::find_if`
- remove all `#include <boost/range/algorithm/find_if.hpp>`
to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21495
* github.com:scylladb/scylladb:
treewide: replace boost::find_if with std::ranges::find_if
counters: replace boost::find_if with std::ranges::find_if
combine.hh: use std::iter_const_reference_t when appropriate
Currently, PER PARTITION LIMIT is not implemented for aggregates and queries can result in more rows than expected from the same partition.
Instrument the result_set_builder class so that it can enforce PER PARTITION LIMIT for aggregate queries, specifically:
- add per_partition_limit to the result_set_builder
- expose the number of input rows in the selector
result_set_builder gets two new functions handling partition start and end:
- accept_partition_end for notifying that a partition has been finished. This is also called when a page ends, so we cannot simply flush here, as a naive implementation could do.
- accept_new_partition, where we flush_selectors() if it's indeed a new partition (and not a continuation of the previous) and the query has a grouping: we don't want to flush on new partition in a query like SELECT COUNT(*) FROM foo;
Fixes#5363Closesscylladb/scylladb#21125
* github.com:scylladb/scylladb:
test: enable PER PARTIION LIMIT + GROUP BY tests
cql3: respect PER PARTITION LIMIT for aggregates
cql3: selection: count input rows in the selector
cql3: selection: pass per partition limit to the result_set_builder
cql3: show different messages for LIMIT and PER PARTITION LIMIT in get_limit
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.
A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.
Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.
Fixes: #21612.
Needs backport to 6.1 and 6.2 as they contain the bug.
Closesscylladb/scylladb#21616
* github.com:scylladb/scylladb:
test: add test to check if repair is properly aborted
repair: add shard param to task_manager_module::is_aborted
repair: use task abort source to abort repair
repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
repair: fix task_manager_module::abort_all_repairs
Those internal APIs allow to add / del a tablet repair request and
config the tablet repair scheduler.
It can be used by task manager or plain restful api.
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.
The current repair scheduler implementation does the following:
- A tablet is picked to be repaired when the time since last repair is
bigger than a threshold (auto repair mode) or it is requested by user
(manual repair mode)
- The tablet repair can be scheduled along with tablet migration and
rebuild. It runs in the tablet_migration track.
- Repair jobs are scheduled in a smart way so that at any point in time,
there are no more than configured jobs per shard, which is similar to
scylla manager's control.
In this patch, both the manual repair and the auto repair are not
enabled yet.
This patch adds functionality to track Write Capacity Units (WCU).
Currently for the put_item operation.
This enhancement allows for standardized measurement of write
operations, aligning with DynamoDB-like metrics.
Additionally, the WCU value is now optionally included in the response to provide
immediate feedback on the write capacity usage.
The implementation adds a consumed_capacity_counter member to
rmw_operation, this will allow to add WCU functionality to update_item
and delete_item
This patch adds testing for the consumedCapacity header.
It's currently only test get_item
The test works with both AWS and alternator.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds functionality to track Read Capacity Units (RCU) for the
get_item operation. This enhancement allows for standardized measurement
of read operations, aligning with DynamoDB-like metrics.
Additionally, the RCU value can now be included in the response to
provide immediate feedback on the read capacity usage.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Introduced `rcu` (Read Capacity Units) and `wcu` (Write Capacity Units)
metrics to the `stats` object for enhanced capacity tracking.
`rcu` and `wcu` provide a simplified way of measuring reads and writes,
respectively, by representing capacity usage in standardized units.
This patch adds these metrics to the existing alternator stats, enabling
monitoring of the total consumed units.
Alternator API should support returning WCU and RCU when requested.
The consumed capacity helper class serves multiple purposes:
1. Break the logic of calculating the RCU and WCU from the main code.
2. Add a helper class consumed_capacity_counter that can accumulate bytes.
3. Optionally update counters for RCU and WCU that will be used by the
metric layer.
4. Update the response with the consumed units if needed.
The consumed_capacity_counter is a base class with two implementations:
A read and write implmenentation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.
The scenario corresponding to the summary above looked like this:
1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
table has a row, so it generates a view update for it because
it doesn't notice that we already have data in the view.
We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.
Additionally, we also fix a lifetime issue of the row the test revolves
around, further stabilizing CI.
Fixes https://github.com/scylladb/scylladb/issues/20889
Backport: These changes have no semantic effect on the codebase,
but they stabilize CI, so we want to backport them to the maintained
versions of Scylla.
Closesscylladb/scylladb#21632
* github.com:scylladb/scylladb:
test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime
The auxiliary function `eventually()` (defined in `test/lib/eventually.hh`)
tries to execute a passed function. If it throws, `eventually()` sleeps
for `2^#previous_attempts` milliseconds and tries to perform it again.
The default limit of attempts is 17.
In `test_view_update_generating_writetime`, right before the last test
case, we perform:
```cql
UPDATE t USING TTL 10 AND TIMESTAMP 8 SET g=40 WHERE k=1 AND c=1;
```
The test case itself executes:
```cql
SELECT WRITETIME(g) FROM t;
```
and asserts that the result of the query is equal to 8, i.e. it
corresponds to the timestamp of the last write to the table `t`.
However, if the test case keeps failing, then during its 14th attempt
(so affter sleeping for at least `2^14 - 1` milliseconds, which amounts
to about 16 seconds), we'll observe the following error:
```
[Exception] - std::runtime_error: Expected row not found: [0000000000000008] not in {result_message::rows {row: null}}
```
The reason behind it is the specified TTL is too short. 10 seconds will
have already passed before the 14th attempt, so the value in the column
`g` will be `NULL` again. In particular, the `WRITETIME(g)` will no
longer be equal to `8`.
To solve that issue, we change the TTL in the CQL statement to 300.
The time spent on 17 loops of `eventually()` amounts to about
`2^18 - 1` milliseconds, which is about 263 seconds. That's why
setting the TTL to 300 seconds should be enough to prevent the error
from occurring.
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.
The scenario corresponding to the summary above looked like this:
1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
table has a row, so it generates a view update for it because
it doesn't notice that we already have data in the view.
We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.
Fixesscylladb/scylladb#20889
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.
Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.
However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().
This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.
Fixes#21538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21540
Adding `Fixes` validation to a PR when backport labels were added. When the auto backport process triggers (after promotion), we will ensure each PR with backport/x.y label also has in the PR body a `Fixes` reference to an issue
Fixes: https://github.com/scylladb/scylladb/issues/20021Closesscylladb/scylladb#21563
Change e3e8a94c9a changed
the semantics of the enable_tablets config option,
but updating that in the option documentation in scylla.yaml
was missed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21614
This change was created in the same spirit of f8221b960f.
The S3ProxyServer (introduced in 8919e0abab) currently prints its
status directly to stdout, which can be distracting when reviewing test
results. For example:
```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
Setting minio proxy random seed to 1731924995
Starting S3 proxy server on ('127.193.179.2', 9002)
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/1] object_store release [ PASS ] object_store.test_backup.1
Stopping S3 proxy server
------------------------------------------------------------------------------
CPU utilization: 3.1%
```
Move these messages to use proper logging to give developers more control
over their visibility:
- Make logger parameter mandatory in S3ProxyServer constructor
- Route "Stopping S3 proxy" message through the provided logger
- Add --log-level option to the standalone proxy server launcher
The message is now hidden:
```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/1] object_store release [ PASS ] object_store.test_backup.1
------------------------------------------------------------------------------
CPU utilization: 4.1%
```
---
this change improves the developer experience, hence no need to backport.
Closesscylladb/scylladb#21610
* github.com:scylladb/scylladb:
test: route S3 Proxy server messages through logger
test: s3_proxy: remove unused method
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::find_if`.
in this change, we:
- replace `boost::find_if` with `std::ranges::find_if`
- remove all `#include <boost/range/algorithm/find_if.hpp>`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
std::ranges allows us to create a range from a pair of iterators.
but the iterator has to fulfill the concept of `std::semiregular`.
in order to reduce the header dependency on boost, we need to
make `basic_counter_cell_view::shard_iterator` to support
`std::semiregular`.
in this change:
- define a default constructor for
`basic_counter_cell_view::shard_iterator`, so that the iterator
satisfies the constraints of `std::semiregular`, as required by
C++20's forward_iterator concept. please note, despite that
the standard requires the iterator to be `std::semiregular`, but
the iterator created by default constructor is not evaluated in
production. sometimes, the standard algorithms just need to
store/create itermediate iterators or to represent a "singular"
state for iterator. a use case is an empty container.
- change `basic_counter_cell_view::shard_iterator::reference` so
its dereference returns a rvalue instead of a reference. because
per C++20 standard, the dereference of a forward_iterator should
be stable, but we were returning a reference / pointer referencing
a member variable of the iterator. so once the iterator is destructed,
the returned reference / pointer would be invalidated. so we have to
return a value to fulfill the requiremend of forward_iterator. this
change also fulfills the requirement of `same_as<iter_reference_t<It>,
iter_reference_t<const It>>`, which a part of the
`indirectly_readable` requirement.
- let `basic_counter_cell_view::shards()` return a subrange
- let `basic_counter_shard_view::swap_value_and_clock()` accepts
a plain value instead of a reference. because the dereference of
the iterator does not return a reference anymore. and the returned
type is a lightweighted "view", so the performance penality is
negligible.
- use ranges libraries when appropriate in this header.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we assumed that the dereference types of
the given `InputIterator1` and `InputIterator2` are always
references. but this does not hold if the `operator*` returns
a rvalue, as in the C++20 standard, unlike the LegacyForwardIterator
requirement, `std::forward_iterator` does not requires
dereference to return a reference. so we should not assume this,
if we want to use `combine()` with iterators whose dereference
return a, for instance, rvalue.
in this change, we use `std::iter_const_reference_t` instead. this
type is deduced from the behavior of the iterator instead of hardwire
it to a reference type. this allows us to use a C++20 forward_iterator
with this generic function.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The stream_weight for repair migration is set to 2, because it requires
more work than just moving the tablet around. The stream_weight for all
other migrations are set to 1.
For historic reasons, we have (in bytes.hh) a type sstring_view which is an alias for std::string_view - since the same standard type can hold a pointer into both a seastar::sstring and std::string.
This alias in unnecessary and misleading to new developers, who might be misled to believe it is assume it is somehow different from std::string_view - when it isn't.
This series removes all uses of sstring_view (changing them to use std::string_view), and in the last patch removes the alias itself. A few functions whose name referred to "sstring" but take a std::string_view were renamed.
The patches are fairly mechanical and trivial, with no functional changes intended. To ease the review the series was split to a few smaller patches that modify specific areas of the code.
Fixes#4062.
Closesscylladb/scylladb#21617
* github.com:scylladb/scylladb:
bytes: remove unused alias sstring_view
change remaining sstring_view to std::string_view
test: change sstring_view to std::string_view
cql3: change sstring_view to std::string_view
alternator: change sstring_view to std::string_view
type: change from_sstring() to from_string_view()
cross-tree: change to_sstring_view() to to_string_view()
Wean the mutation code (at least the headers) from boost ranges to std ranges,
in order to reduce the dependency load.
Cleanup, so no backport.
Closesscylladb/scylladb#21601
* github.com:scylladb/scylladb:
partition_snapshot_row_cursor.hh: switch from boost ranges to std ranges
mutation: mutation_partition_v2.hh: switch from boost ranges to std ranges
mutation: mutation_partition.hh: switch from boost ranges to std ranges
partition_snapshot_reader.hh: drop unused include boost/range/algorithm/heap_algorithm.hpp
This change adds support for PER PARTITION LIMIT for aggregate queries.
result_set_builder gets two new functions handling partition start and
end:
- accept_partition_end for notifying that a partition has been finished.
This is also called when a page ends, so we cannot simply flush here,
as a naive implementation could do.
- accept_new_partition, where we flush_selectors() if it's indeed a new
partition (and not a continuation of the previous) and the query has a
grouping: we don't want to flush on new partition in a query like
SELECT COUNT(*) FROM foo;
select_statement::get_limit is used to evaluate the LIMIT value for both
LIMIT and PER PARTITION LIMIT. This change fixes the error message for
incorrect values passed by the user.
Currently, task_manager_module::is_aborted checks whether a task
with given id was aborted on this shard.
In tablet_repair_task_impl::run, is_aborted method is called on all
shards to check if the parent task was aborted. However, even
for aborted parent, is_aborted will return true only on owner shard
of the parent.
Pass shard param to task_manager_module::is_aborted that indicates
which shard to check.
Currently, task_manager_module::abort_all_repairs marks top-level
repairs as aborted (but does not abort them) and aborts all shard
tasks. If after that a top-level repair creates a shard task,
the new shard repair won't be aborted.
Abort top-level repair tasks in abort_all_repairs. They will abort
their children and newly created shard tasks will be immediately
aborted.
Our "sstring_view" was an historic alias for the standard std::string_view.
All its uses were removed in the previous patches, so we can now finally
remove this unused alias.
Refs #4062.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our "sstring_view" is an historic alias for the standard std::string_view.
The patch changes the last remaining random uses of this old alias across
our source directory to the standard type name.
After this patch, there are no more uses of the "sstring_view" alias.
It will be removed in the following patch.
Refs #4062.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our "sstring_view" is an historic alias for the standard std::string_view.
The test/ directory used this old alias in a few of random places, let's
change them to use the standard type name.
Refs #4062.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our "sstring_view" is an historic alias for the standard std::string_view.
The cql3/ directory used this old alias in a few of random places, let's
change them to use the standard type name.
Refs #4062.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our "sstring_view" is an historic alias for the standard std::string_view.
Alternator only used this alias in a couple of random names, let's change
them to the standard type name.
Refs #4062.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All CQL type implementations have a from_sstring(sstring_view) method.
The "sstring_view" type is just an historic alias for std::string_view,
so this patch switches to use the standard type as suggested in #4062,
and also renames these functions from_string_view() to emphesize they can
take any string view, and not necessarily a "sstring" as their old name
suggested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For historic reasons, we have (in bytes.hh) a type sstring_view which
is an alias for std::string_view - since the same standard type can hold
a pointer into both a seastar::sstring and std::string.
This alias in unnecessary and misleading to new developers (who might
assume it is somehow different from std::string_view). This patch doesn't
yet remove all occurances of sstring_view (the request in #4062), but
begins to do it by renaming one commonly-used function, to_sstring_view(bytes)
to to_string_view() and of course changes all its uses to the new name.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This change was created in the same spirit of f8221b960f.
The S3ProxyServer (introduced in 8919e0abab) currently prints its
status directly to stdout, which can be distracting when reviewing test
results. For example:
```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
Setting minio proxy random seed to 1731924995
Starting S3 proxy server on ('127.193.179.2', 9002)
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/1] object_store release [ PASS ] object_store.test_backup.1
Stopping S3 proxy server
------------------------------------------------------------------------------
CPU utilization: 3.1%
```
Move these messages to use proper logging to give developers more control
over their visibility:
- Make logger parameter mandatory in S3ProxyServer constructor
- Route "Stopping S3 proxy" message through the provided logger
- Add --log-level option to the standalone proxy server launcher
The message is now hidden:
```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/1] object_store release [ PASS ] object_store.test_backup.1
------------------------------------------------------------------------------
CPU utilization: 4.1%
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
neither `InjectingHandler.log_error`, nor `InjectingHandler.log_message`
is used. so let's drop them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.
Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.
Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.
Fixes: #21503.
Closesscylladb/scylladb#21504
This PR enables compaction tasks to verify the integrity of the input data through checksum and digest checks. The mechanism for integrity checking was introduced in previous PRs (#20207, #20720) as a built-in functionality of the input streams. This PR integrates this mechanism with compaction. The change applies to all compaction types and covers both compressed and uncompressed SSTables adhering to the 3.x format. If a compaction task reads only part of an SSTable, then only the per-chunk checksums are verified, not the digest.
The PR consists of:
* Changes to mx readers to support integrity checking. The kl readers, considered as compatibility-only, were left unchanged. Also, integrity checking on single-partition reversed reads (`data_consume_reversed_partition()`) remains unsupported by mx readers as this is not used in compaction.
* Changes to `sstable` and `sstable_set` APIs to allow toggling integrity checks for mx readers.
* Activation of integrity checking for all compaction types.
* Tests for all compaction types with corrupted SSTables.
Integrity checks come at a cost. For uncompressed SSTables, the cost is the loading of the CRC and Digest components from disk, and the calculation of checksums and digest from the actual data. For compressed SSTables, checksums are stored in-place and they are being checked already on all reads, so the only extra cost is the loading and calculation of the digest. The measurements show a ~5% regression in compaction performance for uncompressed SSTables, and a negligible regression for compressed SSTables.
Command: `perf-sstable --smp=1 --cpuset=1 --poll-mode --mode=compaction --iterations=1000 --partitions 10000 --sstables=1 --key_size=4096 --num_columns=15 --column_size={32, 1024, 3500, 7000, 14500}`
Uncompressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec) | Integrity (p/sec) | Regression |
+--------------+-----------------------+----------------------+------------+
| 50 MiB | 65175.59 +- 80.82 | 61814.63 +- 72.88 | 5.16% |
| 200 MiB | 41795.10 +- 60.39 | 39686.28 +- 45.05 | 5.05% |
| 500 MiB | 21087.41 +- 30.72 | 20092.93 +- 25.05 | 4.72% |
| 1 GiB | 12781.64 +- 21.77 | 12233.94 +- 21.71 | 4.29% |
| 2 GiB | 6629.99 +- 9.40 | 6377.13 +- 8.28 | 3.81% |
+--------------+-----------------------+----------------------+------------+
```
Compressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec) | Integrity (p/sec) | Regression |
+--------------+-----------------------+----------------------+------------+
| 50 MiB | 53975.05 +- 63.18 | 53825.93 +- 62.28 | 0.28% |
| 200 MiB | 28687.94 +- 26.58 | 28689.41 +- 26.91 | 0% |
| 500 MiB | 13865.35 +- 15.50 | 13790.41 +- 14.88 | 0.54% |
| 1 GiB | 7858.10 +- 7.71 | 7829.75 +- 9.66 | 0.36% |
| 2 GiB | 4023.11 +- 2.43 | 4010.54 +- 2.55 | 0.31% |
+--------------+-----------------------+----------------------+------------+
(p/sec = partitions/sec)
```
Refs #19071.
New feature, no backport is needed.
Closesscylladb/scylladb#21153
* github.com:scylladb/scylladb:
test: Add test for compaction with corrupted SSTables
compaction: Enable integrity checks for all compaction types
sstables: Add integrity option to factories for sstable_set readers
sstables: Add integrity option to sstable::make_reader()
sstables: Add integrity option to mx::make_reader()
sstables: Load checksums and digests in mx full-scan reader
sstables: Add integrity option to data_consume_single_partition()
sstables: Disengage integrity_check from sstable class
sstables: Allow data sources to disable digest check
Metrics families (e.g., all metrics with the same name but with different labels) should have the same description.
The metric layer does not enforce that. Instead, it will use the first description provided.
It's a minor issue but the results are different than what you expect.
No need to backport.
Closesscylladb/scylladb#19947
* github.com:scylladb/scylladb:
service/storage_proxy.cc All metric groups should have the same description
raft/server.cc: All metric groups should have the same description
This depends on the previous change to the schema_builder
which makes version computation depend on definition only
instead of being new time uuid.
This way we avoid the possibility for a common mistake
when schema of a system table is extended but we forget
to bump up its version passed to .with_version().
Currently, if version is missing, we use a unique timeuuid as the
version. It's not useful for creating static schema of system tables
because to achieve the same version on all the nodes, version needs to
be provided externally.
This patch introduces a way to build the schema with version computed
from schema definition, so we can have a stable version which is the
same on all machines.
Will be used for reliable computation of schema version for system
tables. System tables currently set the version statically and we rely
on the developer to bump up the version manually when the definition
changes. We cannot use mutation hash, since system tables are
initialized too rearly (mutation hash needs system schema to be
already there). This is a very error prone process, as it is easy to
forget to do so, and the issue comes up only when testing mixed
clusters.
It will be used for hashing, which will work with raw_schema.
Also, it's more in-line with the current design, where basic information
is kept in raw_schema and other fields are derived from it.
Rename the helper function from `estimated_row_count()` to `estimated_partition_count()`
to better reflect its actual behavior. While the underlying API endpoint is
"/column_family/metrics/estimated_row_count", it actually returns the estimated
partition count of the given table.
This follows up on 26ac2c23ef which updated server-side variable names but
did not change the API endpoint name. A separate change will update the tool's
documentation to address scylladb/scylladb#21586 specifically.
Refs scylladb/scylladb#21586
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21597
Adding `Fixes` validation to a PR when backport labels were added. When the auto backport process triggers (after promotion), we will ensure each PR with backport/x.y label also has in the PR body a `Fixes` reference to an issue
Adding also this validation to `pull_github_pr.sh` per @denesb request,
Fixes: https://github.com/scylladb/scylladb/issues/20021Closesscylladb/scylladb#21563
data_consume_rows_context_m has a _column_value buffer it uses to read
key and column values into, preparing for parsing and consuming them.
This buffer is reset (released) in a few different cases:
* When using it for key - after consuming its content
* When using it for column value - when a colum has no value
However, the buffer is not released when used for a column value and the
column is consumed. This means that if a large column is read from the
sstable, this buffer can potentially linger and keep consuming memory
until either one of the other release scenarios is hit, or the reader is
destroyed.
Add a third release scenario, releasing the buffer after the row end was
consumed. This allows the buffer to be re-used between columns of the
same row, at the same time ensuring that a large buffer will not linger.
This patch can almost halve the memory consumption of reads in certain
circumstances. Point in case: the test
test_reader_concurrency_semaphore_memory_limit_engages starts to fail
after this fix, because the read doesn't trigger the OOM limit anymore
and needs doubling of the concurrency to keep passing.
This issue was found in a dtest
(`test_ics_refresh_with_big_sstable_files`), which writes some large
cells of up to 7MiB. After reading the row containing this large cell,
the reader holds on to the 7MiB buffer causing the semaphore's OOM
protection to kick in down the line.
Fixes: https://github.com/scylladb/scylladb/issues/21160Closesscylladb/scylladb#21132
The later includes the former and in addition to `seastar::format()`,
`print.hh` also provides helpers like `seastar::fprint()` and
`seastar::print()`, which are deprecated and not used by scylladb.
Previously, we include `seastar/core/print.hh` for using
`seastar::format()`. and in seastar 5b04939e, we extracted
`seastar::format()` into `seastar/core/format.hh`. this allows us
to include a much smaller header.
In this change, we just include `seastar/core/format.hh` in place of
`seastar/core/print.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21574
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been confirmed.
also, update the workflow to prevent future regressions of including unused headers in this subdirectory.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21560
* github.com:scylladb/scylladb:
.github: add "message" to CLEANER_DIR
message: do not include unused headers
just for better readability:
* chain comparison statement when appropriate
* do not use f-string when there are no place holders
* use list comprehension when initializing a set
* remove unused import statement
* move import statement of the standard library before
those which import the 3rd-party modules
* put two empty lines in-between top-level functions.
this is recommended by PEP8.
* remove the extraneous spaces around `=` in parameter
list.
* remove the extraneous spaces in a list like `[ 1, 2, 3 ]`
so it looks like `[1, 2, 3]`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21561
This patch moves (after straightforward translation) the test
"test_views_with_future_tombstone", a regression test for #5793,
from the C++ boost framework to the Python cqlpy framework.
The main motivation this move is the ease of debugging failures:
During the work on a patch for #20679 (eliminating read-before-write)
this test began to fail, and understanding where the C++ failed was
near impossible: the Boost test framework reports that the test failed,
but not in which line or why, and adding printouts to this huge source
file require a ridiculous amount of time for recompilation every time.
In contrast, the new pytest-based version shows exactly where the
error is, beautifully:
```
> assert [] == list(cql.execute(f'select * from {mv}'))
E assert [] == [Row(b=2, a=1, c=3, d=4, e=5)]
test_materialized_view.py:1614: AssertionError
```
It shows exactly which assertion failed, and exactly what were the
values that were compared. Beautiful and super helpful for debugging.
Beyond the ease of debugging, moving this (and later, other) test to
the cql-pytest framework has additional advantages:
1. The test was misplaced, in the cql_test source file, and it belongs
with materialized views tests so let's use this opportunity to move
it to the right place.
2. Can easily run the same test on multiple versions of Scylla, and
also on Cassandra. It's a good way to confirm the test is correct.
3. No need to recompile the test after every attempt to fix the bug.
The cql_query_test.cc is huge - over 6,000 lines - and takes over
a minute to compile after every attempt to fix a bug.
Refs #16134 (the issue asks to move all MV tests to cql-pytest)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21552
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):
forward_jump_clocks(std::chrono::seconds(60*60*24*31));
The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.
When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:
ttl_opt topology_request_tracking_mutation_builder::ttl() const {
return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
- std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
}
_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.
The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.
The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().
Closesscylladb/scylladb#21558
Added test cases to reproduce issues with tablet migration involving views.
Refs #19149
Refs #21564
No backport needed as the PR adds only testcases.
Closesscylladb/scylladb#21566
* github.com:scylladb/scylladb:
topology_custom/test_tablets.py: add testcase for tablet migration of staged sstables
topology_custom/test_tablets.py: add testcase for tablet migration with unbuilt views
Tablet migration mixes staged and non staged sstables causing base view
inconsistencies in the pending replica. Added a testcase to reproduce
this issue.
Refs #19149.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When a tablet gets migrated right after view was created but before the
view builder registered the new view, the pending replica will not
register the sstables in the tablet for view building causing base view
inconsistencies. This commit adds a testcase to reproduce the issue.
Refs #21564
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
We recently added a "--release <version>" option to test/cql-pytest/run
to run a cql-pytest test against a released version of Scylla, downloaded
automatically from ScyllaDB's precompiled binary repository. This patch
adds the same capability also to test/alternator/run - allowing to run
a current test/alternator test on older releases of Scylla. The
implementation in this patch reuses the same implementation from the
cql-pytest patch.
Here is an example use case: the pull request #19941 claimed that
a certain bug fix was backported to release 6.0. Was it? Let's run
the test reproducing that bug on two releases:
test/alternator/run --release 6.0 test_streams.py::test_stream_list_tables
test/alternator/run --release 6.1 test_streams.py::test_stream_list_tables
It shows that the test passes on 6.1 (so the bug is fixed there) but the
test fails 6.0. It turns out that although the fix was backported to
branch-6.0, this happened shortly after 6.0.4 was released and no later
6.0 minor release came afterwards! So the bug wasn't actually fixed
on any official release of 6.0.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21343
in order to prevent future inclusion of unused headers, let's include
"message" subdirectory to CLEANER_DIR, so that this workflow can
identify the regressions in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.
Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.
Fixes: scylladb/scylladb#19959Closesscylladb/scylladb#21477
this change was created in the same spirit of aebb5329, which
included the fmt/iostream.h and iostream when appropriate so that
the tree can build with seastar submodule including e96932b0.
in the seastar change, we stopped including unused `fmt/ostream.h`
in a public header in seastar, so the parent projects relying on
the header to indirectly include fmt/ostream.h and iostream would
have to include these headers explicitly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21525
In cc71077e33, i have added check for the
last line in pr body looking for `closes` prefix.
It seems that this is wrong, since in a merge PR, the `closes` prefix is
not the last line
Instead, changing the search for the last line contains `closes` prefix
Closesscylladb/scylladb#21545
After merged 5a470b2bfb, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.
Fixes https://github.com/scylladb/scylladb/issues/21441
Also, since Amazon Linux 2 has different package name for semange, we need to
adjust package name.
Fixes https://github.com/scylladb/scylladb/issues/21351Closesscylladb/scylladb#21474
* github.com:scylladb/scylladb:
scylla_raid_setup: support installing semanage on Amazon Linux 2
scylla_raid_setup: fix failure on SELinux package installation
In the previous patch we enabled integrity checking on all compaction
types. This means that compaction jobs should now fail if they encounter
an SSTable with an invalid checksum or digest.
Add a test to verify this behavior. Test every compaction type with:
* compressed/uncompressed SSTables with invalid checksums
* compressed/uncompressed SSTables with invalid digests
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Compaction tasks create mutation readers to read SSTables from disk.
Each compaction type defines its own reader creation logic by
implementing the pure virtual function `compaction::make_sstable_reader()`.
Modify all implementations of `make_sstable_reader()` to enable
integrity checking on the created readers. This way, all compaction
tasks will be able to detect corruption issues on the compacting
SSTables.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Expose the integrity option of the sstable reader factories to the
corresponding sstable_set factories, namely:
* `sstable_set::make_local_shard_sstable_reader()`
* `sstable_set::make_full_scan_reader()`
* `sstable_set::make_range_sstable_reader()`
This is needed to support integrity checking in compaction.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Expose the integrity option of the mx reader via the public factory
method `sstable::make_reader()`. Same flag is offered for full-scan
readers via `sstable::make_full_scan_reader()`.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In previous patch we added support for integrity checking in the mx
full-scan reader.
Do the same for the mx reader, which is the one used by all compaction
types except for scrub compaction. The mx reader should now support
integrity checking for single-partition and multi-partition reads.
Single-partition reversed reads were excluded from this patch because
they are not used in compaction.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In 716fc487fd we introduced integrity checking in the mx crawling reader
(later renamed to full-scan reader in 6250ff18eb).
When integrity checking is enabled, the full-scan reader expects that
the checksum and digest components have been loaded from disk by the
caller. This is true for the validation path, in which
`sstable::validate()` loads the components before creating the full-scan
reader, but it doesn't hold if a full-scan reader is created directly by
a higher-level function through `sstable::make_full_scan_reader()`.
As part of the effort to enable integrity checking for compaction, this
becomes a blocker for scrub compaction, which relies solely on full-scan
readers.
Solve this by allowing the mx full-scan reader to load the checksum and
digest components internally. The loading is an asynchronous operation,
so it has to be deferred until the first buffer fill.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `integrity_check` flag was first introduced as a parameter in
`sstable::data_stream()` to support creating input streams with
integrity checking. As such, it was defined in the sstable class.
However, we also use this flag in the kl/mx full-scan readers, and, in
a later patch, we will use it in `class sstable_set` as well.
Move the definition into `types_fwd.hh` since it is no longer bound to
the sstable class.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The compressed and checksummed data sources offer digest checking as an
optional feature. It can be enabled via the boolean template parameter
`check_digest`. If enabled, the data sources calculate the actual digest
chunk-by-chunk whenever `get()` is called, and compare with the expected
digest when all data have been read. If the actual digest cannot be
calculated due to a partial read or skip, the data sources treat this
condition as an internal error.
Relax this constraint by allowing the data sources to handle digest
checks as best effort, i.e., continue to operate with digest checking
disabled if the actual digest cannot be calculated.
We will use this in later patches to enable digest checking for
compaction. Compaction can cause both partial reads and skips (e.g., in
case of cleanup compaction) and we cannot predict skips beforehand.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The req_param class is used to help parsing http request parameters from
strings into exact types (typically some simple types like strings,
integrals or boolean). On it there are three fields:
- name -- the parameter name
- param -- the parameter string value
- value -- the parameter value of desired type
The `param` thing is not really needed, it's only used by few places
that print it into logs, but they may as well just print the `value`
thing itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21502
After calling api::parse_tables() the resulting vector of table names
cannot be empty, because in case parameter is missing, the parse_tables
function returns all tables from keyspace anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21501
Split compaction divides the partitions in an existing sstable into two groups and writes them into two new sstables, which replace the original one. The partition count from the original sstable is used as an estimate when writing the new ones, but this estimate is not accurate as the partitions are split between the two new sstables and each will contain only a portion of the original partition count. This also causes the bloom filters to be rebuilt at the end of compaction, as they were initially built with inaccurate estimates.
Fix this by using a better estimate for the output sstables, which is half the original partition count.
Fixes#20253
Improvement; No need to backport.
Closesscylladb/scylladb#20908
* github.com:scylladb/scylladb:
compaction: use better partition estimate for split compaction
compaction::table_state: implement `get_token_range_after_split()` wrapper
replica/table: implement `get_token_range_after_split()` wrappers
tablet_map: introduce `get_token_range_after_split()`
tablet_map: implement existing get_token_range() using the new variant
tablet_map: introduce `get_token_range()` variant
tablet_map: introduce `get_last_token()` variant
Add test to ensure backup tasks properly handle non-existent snapshots
by:
- Verifying backup task reports failure status
- Ensuring error is propagated through task status API
Previously untested edge case when backing up a snapshot that doesn't
exist in the test_backup.py tests.
Refs scylladb/scylladb#21381
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21385
When a backport PR is promoted to the release branch, we automatically close the backport PR (since GitHub will only close the one based on the default branch) and update the labels in the original PRs
In a situation when we have multiple `closes` prefixes, the script will use the first one (which is not the correct one), see 3ddb61c90e
Fixing this by always using the last line with the `closes` prefix
Closesscylladb/scylladb#21498
After merged 5a470b2, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.
Fixes#21441
Split compaction divides the partitions in an existing sstable into two
groups and writes them into two new sstables, which replace the original
one. The partition count from the original sstable is used as an
estimate when writing the new ones, but this estimate is not accurate as
the partitions are split between the two new sstables and each will
contain only a portion of the original partition count. This also causes
the bloom filters to be rebuilt at the end of compaction, as they were
initially built with inaccurate estimates.
Fix this by using a better estimate for the output sstables based on the
token ranges written to them.
Fixes scylladb#20253
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Expose the functionality of `tablet_map::get_token_range_after_split()`
via the replica::table class.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added `get_token_range_after_split()`, which returns the token range the
given token will belong to after a tablet split. This is required to
estimate the token ranges of resultant sstables after a split.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Implement `get_token_range()` to return the token range of the specified
tablet with the given `log2_tablets` size. This will be used to deduce
which range a token will end up in if the tablet is split.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Implement `get_last_token()`, which returns the largest token owned
by the specified tablet with the given `log2_tablets` size. This will be
used to deduce token ranges for a tablet with any arbitrary
`tablet_count`.
Also, update the existing public `get_last_token()` to utilize the new
variant.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
1. Add `retry_strategy` interface and default implementation for exponential back-off retry strategy.
2. Add new S3 related errors, also introduce additional errors to describe pure http errors that has no additional information in the body.
3. Add retries to the s3 client, all retries are coordinated by an instance of `retry_strategy`. In a case of error also parse response body in attempt to retrieve additional and more focused error information as suggested by AWS. See https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html. Introduce `aws_exception` to carry the original `aws_error`.
4. Discard whatever exception is thrown in `abort_upload` when aborting multipart upload since we don't care about cleanly aborting it since there are other means to clean up dangling parts, for example `rclone cleanup` or S3 bucket's Lifecycle Management Policy.
5. Add tests to cover retries, and retry exhaustion. Also add tests for jumbo upload.
6. Add the S3 proxy which is used to randomly inject retryable S3 errors to test the "retry" part of the S3 client. Switch the `s3_test` to use the S3 proxy. `s3_tests` set afloat `put_object` problem that was causing segmentation when retrying, fixed.
7. Extend the `s3_test` to use both `minio` and `proxy` configurations.
8. Add parameter to the proxy to seed the error injection randomization to make it replayable.
fixes: #20611fixes: #20613Closesscylladb/scylladb#21054
* github.com:scylladb/scylladb:
aws_errors: Make error messages more verbose.
test: Make the minio proxy randomization re-playable
test/boost/s3_test: add error injection scenarios to existing test suite
test: Switch `s3_test` to use proxy
test: Add more tests
client: Stop returning error on `DELETE` in multipart upload abortion
client: Fix sigsegv when retrying
client: Add retries
client: Adjust `map_s3_client_exception` to return exception instance
aws_errors: Change aws_error::parse to return std::optional<>
aws_errors: Add http errors mapping into aws_error
client: Add aws_exception mapping
aws_error: Add `aws_exeption` to carry original `aws_error`
aws_errors: Add new error codes
client: Introduce retry strategy
Add a buffer hint to the multishard reader. This is an internal hint, used by the multishard reader to provide a hint to the shard reader, on how much data exactly is needed by the multishard reader from the respective shard. This hint allows eliminating extraneous cross-shard round-trips and possible shard reader evict-recreate cycles. Building on this, repair sets its own row buffer size as the max buffer size on the multishard reader, ensuring that the row buffer is filled with the minimum amount of cross-shard round trips and minimal reader recreation.
To further eliminate unnecessary evictions, this PR also disables the multishard reader's read-ahead which is a mechanism that was designed to reduce latency for user-reads but it can be too aggressive for repair, causing unnecessary extra congestion on the already struggling streaming semaphores.
Refs: https://github.com/scylladb/scylladb/issues/18269
Fixes: https://github.com/scylladb/scylladb/issues/21113
The performance impact was measured with an SCT test, which creates a cluster of 3 nodes with 16 shards, then adds a 4th one with 12 shards.
Currently, it is the bootstrap time which is the worse in the case of mixed shard clusters, see below for the improvement measured during bootstrap:
| | master | buffer-hint | metric |
| ------------ | ------------- | ------------- | --------------------------------------------------- |
| evictions | 0.9M | 93.0K | scylla_database_paused_reads_permit_based_evictions |
| read (bytes) | 9.0T | 3.9T | scylla_reactor_aio_bytes_read |
| read (ops) | 88.0M | 33.5M | scylla_reactor_aio_reads |
| time | 56min | 20min | N/A |
This is a performance improvement, no backport required.
Closesscylladb/scylladb#20815
* github.com:scylladb/scylladb:
test/boost/mutation_reader_test: add test for multishard reader buffer hint
repair/row_level: disable read-ahead
db/config: introduce repair_multishard_reader_enable_read_ahead
readers/multishard: implement the read_ahead flag
replica/database: make_multishard_streaming_reader(): expose the read_ahead parameter
readers/multishard: add read_ahead parameter
repair/row_level: set max buffer size on multishard reader
replica/database: make_multishard_streaming_reader(): expose buffer_hint parameter
db/config: introduce enable_repair_multishard_reader_buffer_hint
readers/multishard: multishard_reader: pass hint to shard_reader
readers/multishard: shard_reader_v2::fill_reader_buffer(): respect the hint
readers/multishard: propagate fill_buffer_hint to shard_reader:fill_reader_buffer()
readers/multishard: shard_reader: extract buffer-fill into its own method
before this change, we specify the KillMode of the scylla-service
service unit explicitly to "process". according to
according to
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html,
> If set to process, only the main process itself is killed (not recommended!).
and the document suggests use "control-group" over "process".
but scylla server is not a multi-process server, it is a multi-threaded
server. so it should not make any difference even if we switch to
the recommended "control-group".
in the light that we've been seeing "defunct" scylla process after
stopping the scylla service using systemd. we are wondering if we should
try to change the `KillMode` to "control-group", which is the default
value of this setting.
in this change, we just drop the setting so that the systemd stops the
service by stopping all processes in the control group of this unit
are stopped.
Refs scylladb/scylladb#21507
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21508
before this change, the header files generated with `idl-compiler.py`
are not regenerated if `idl-compiler.py` is updated. but they should,
as the change to the script could in turn change the generated header
files. because we have a typo in the `DEPENDS` argument,
`${idle_compiler}` is expanded to an empty string.
in this change, the typo is corrected, and the dependency from the
generated headers to the script is correctly reflected in the building
rules.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21475
In scylladb/scylladb#19745, view_builder was migrated to group0 and since then it is dependant on group0_service.
Because of this, group0_service should be initialized/destroyed before/after view_builder.
This patch also adds error injection to `raft_server_with_timeouts::read_barrier`, which does 1s sleep before doing the read barrier. There is a new test which reproduces the use after free bug using the error injection.
Fixesscylladb/scylladb#20772scylladb/scylladb#19745 is present in 6.2, so this fix should be backported to it.
Closesscylladb/scylladb#21471
* github.com:scylladb/scylladb:
test/boost/secondary_index_test: add test for use after free
api/raft: use `get_server_with_timeouts().read_barrier()` in coroutines
main,cql_test_env: start group0_service before view_builder
in seastar e96932b05f394b27cd0101e24f0584736795b50f, we stopped
including unused `fmt/ostream.h`. this helped to reduce the header
dependency. but this also broke the build of scylladb, as we rely
on the `fmt/ostream.h` indirectly included by seastar's header project.
in this change, we include `fmt/iostream.h` and `iostream` explictly
when we are using the declarations in them. this enables us to
- bump up the seastar submodule
- potentially reduce the header dependency as we will be able to
include seastar/core/format.hh instead of a more bloated
seastar/core/print.hh after bumping up seastar submodule
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21494
Reproduces scylladb/scylladb#20772.
Add error injection to `raft_server_with_timeouts::read_barrier`,
which does 1s sleep before doing the read barrier.
It is unsafe to do `get_server_with_timeouts().read_barrier()` in
continuations because `get_server_with_timeouts()` returns
raft server by value and it may be deallocated when `read_barrier()` yields,
causing use-after-return.
Simple workaround is to use the read barrier in coroutine and co_await
it. Then the raft server is kept on stack until the read barrier is
finished.
I've checked all codebase and it looks like the only place where
`group0_with_timeouts().read_barrier()` is in continuation, is
api/raft.cc.
Co-authored-by: Piotr Dulikowski <piodul@scylladb.com>
Separate the configuration for enabling the tablets feature from the enablement of tablets when creating new keyspaces.
This change always enables the TABLETS cluster feature and the tablets logic respectively.
The `enable_tablets` config option just controls whether tablets are enabled or disabled by default for new keyspaces.
If `enable_tablets` is set to `true`, tablets can be disabled using `CREATE KEYSPACE WITH tablets = { 'enabled': false }` as it is today.
If `enable_tablets` is set to `false`, tablets can be enabled using `CREATE KEYSPACE WITH tablets = { 'enabled': true }`.
The motivation for this change is to simplify the user experience of using tablets by setting the default for new keyspaces to false amd allowing the user to simply opt-in by using tablets = {enabled: true }.
This is not pissible today.
The user has to enable tablets by default for all new keyspaces (that use the NetworkTopologyStrategy) and then actively opt-out to use vnodes.
* Not required to be backported to OSS versions. May be backported to specific enterprise versions
* This PR resubmits https://github.com/scylladb/scylladb/pull/20729 that was reverted in 73b1f66b70 due to https://github.com/scylladb/scylladb/issues/21159 which is now fixed
Closesscylladb/scylladb#21451
* github.com:scylladb/scylladb:
data_dictionary: keyspace_metadata::describe: print tablets enabled also when defaulted
tablets_test: test enable/disable tablets when creating a new keyspace
treewide: always allow tablets keyspaces
feature_service: prevent enabling both tablets and gossip topology changes
alternator: create_keyspace_metadata: enable tablets using feature_service
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.
For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.
This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.
Fixesscylladb/scylladb#13757Closesscylladb/scylladb#21459
Provide a seed to the proxy randomization, the idea that the `test.py` will initialize the seed from `/dev/urandom` and print the seed when starting, in case some tests failed the dev is supposed to re-play it locally with the same seed (if it didnt repro otherwise) using the `start_s3_proxy.py` and providing it with the aforementioned seed using `--rnd-seed` command line argument
Add variants of existing S3 tests that route through a proxy instead of connecting directly to MinIO. The proxy allows injecting errors to validate error handling and recovery mechanisms under failure conditions.
Switch `s3_test` to use the S3 proxy which is used to randomly inject retryable S3 errors to test the "retry" part of the S3 client.
Fix `put_object` to make it retryable
Discard whatever exception is thrown in `abort_upload` when aborting multipart upload since we don't care about cleanly aborting it since there are other means to clean up dangling parts, for example `rclone cleanup` or S3 bucket's Lifecycle Management Policy
Add retries to the s3 client, all retries are coordinated by an instance of `retry_strategy`. In a case of error also parse response body in attempt to retrieve additional and more focused error information as suggested by AWS. See https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html.
Also move the expected http status check to the `make_s3_error_handler` since the http::client::make_request call is done with `nullopt` - we want to manage all the aws errors handling in s3 client to prevent the http client to validate it and fail before we have a chance to analyze the error properly
"Unfuturize" the `map_s3_client_exception` since the retryable client is going to be implemented using coroutines and no `future` is needed here, just to save unnecessary `co_await` on it
In scylladb/scylladb#19745, view_builder was migrated to group0
and since then it is dependent on group0_service.
Because of this, group0_service should be initialized/destroyed
before/after view_builder.
Fixesscylladb/scylladb#20772
Co-authored-by: Dawid Mędrek <dawid.medrek@scylladb.com>
Python and Python developers don't like directory names to include a minus sign, like "cql-pytest". In this patch we rename test/cql-pytest to test/cqlpy, and also change a few references in other code (e.g., code that used test/cql-pytest/run.py) and also references to this test suite in documentation and comments.
Arguably, the word "test" was always redundant in test/cql-pytest, and I want to leave the "py" in test/cqlpy to emphasize that it's Python-based tests, contrasting with test/cql which are CQL-request-only approval tests.
The second patch in the series fixes a small regression in the test/cqlpy/run script.
Fixes#20846
Test organization only, so backports not strictly necessary, but let's do them anyway because otherwise it will make any future backporting of tests in the cqlpy directory more messy than it needs to be.
Closesscylladb/scylladb#21446
* github.com:scylladb/scylladb:
test/cqlpy: fix "run" script without any parameters
test: rename "cql-pytest" to "cqlpy"
Test both configuration values for `enable_tablets`
and the possibility to explicitly enable or disable
tablets, respectively, when creating a keyspace using the
`tablets = {'enabled': true|false}` CREATE KEYSPACE option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the tablets feature always enabled (Unless gossip toopology
changes are forced), the enable_tablets option now controls only
the default for newly created keyspaces.
Even when set to `false`, tablets are still enabled as a
feature and the user may explicitly enable tablets
using `CREATE KEYSPACE <name> WITH tablets = {'enabled': true}`
Note: best viewed with `git show -w`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Tablets require raft consistent topology changes.
Therefore, document that they are incompatible in
the config help and prevent their usage in
`feature_config_from_db_config`
Fixesscylladb/scylladb#21075
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The multishard reader's read-ahead was designed to reduce the latency of
range scans. But in the case of repair, read-ahead is suspected to
contribute significant extra load on the congested streaming semaphore
and thus contribute to the subsequent trashing (excessive reader
eviction).
First off, read-ahead was designed with pages of limited size in mind.
Repair can read much more, even for a single repair buffer. This can
lead to read-ahead concurrency to continue ramping up, creating and
using more and more readers.
Secondly, repair is not latency sensitive, so even when working well and
there is no congestion, the benefits are negligible.
The use of read-ahead is now controllable by the new
repair_multishard_reader_enable_read_ahead config item, defaulting to
false.
Continuing the previous patch, expose the just added read_ahead
parameter of make_multishard_combining>_reader_v2().
Set to read_ahead::yes by all callers, keeping the current default.
The multishard reader is used in the mixed-shard case, when a repair has
to read from all other shards. It is very important that cross-shard
roundtrips and possible evict-recreate cycles for the shard readers is
avoided. For this end, make use of the recently introduced internal
buffer hint feature in the multishard reader and set it's buffer size to
match that of the row level repair buffer size.
The use of the buffer-hint can be controlled with the recently
introduced repair_multishard_reader_buffer_hint_size config param.
Expose the buffer hint functionality added by the previous commits, to
callers of make_multishard_streaming_reader(). All callers disable it
currently, it will be used in the next patch.
`commit.get_pulls()` in PyGithub returns pull requests that are directly associated with the given commit
Since in closed PR. the relevant commit is an event type, the backport
automation didn't get the PR info for backporting
Ref: https://github.com/scylladb/scylladb/issues/18973Closesscylladb/scylladb#21468
Sample the clock once to avoid the filter returning different results.
Range algorithms may use multiple passes, so it's better to return
consistent results.
Closesscylladb/scylladb#21400
we don't use `std::list` in compaction/compaction_manager.hh, neither
is this header responsible for exposing the declarations in `<list>`.
so let's stop `#include` this header.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21436
Fixes#20716
Adds optional abort_source to all s3 client operations. If provided, will propagate to actual HTTP client and allow for aborting actual net op.
Note: this uses an abort source per call, not a client-local one.
This is for two reasons:
1.) The usage pattern of the client object is to create it outside the eventual owning object (task) that hosts the relevant abort source
2.) It is quite possible to want to have different/no abort source for some operation usage.
Also adds forward usage of task abort_source in backup tasks upload s3 call, making it more readily abort-able.
Closesscylladb/scylladb#21431
* github.com:scylladb/scylladb:
backup_task: Use task abort source in s3 client call
s3::client: Make operations (individually) abortable
In order not to forget to resolve conflicts in backport PRs, we should add some reminders to the PR author so it will not be forgotten
the new action will run twice a week and will send a reminder only for
PR opened with conflicts for 3 days or more
Fixes: https://github.com/scylladb/scylladb/issues/21448Closesscylladb/scylladb#21449
Fixes#20716
Propagates abort source in task object to actual network call,
thus making the upload workload more quickly abortable.
v2: Fix test to handle two versions after each other
A recent improvement to test/cqlpy/run to add the "--release" option
broke the ability to run this script it without *any* options (no test
name, etc.). This patch fixes this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Python and Python developers don't like directory names to include a
minus sign, like "cql-pytest". In this patch we rename test/cql-pytest
to test/cqlpy, and also change a few references in other code (e.g., code
that used test/cql-pytest/run.py) and also references to this test suite
in documentation and comments.
Arguably, the word "test" was always redundant in test/cql-pytest, and
I want to leave the "py" in test/cqlpy to emphasize that it's Python-based
tests, contrasting with test/cql which are CQL-request-only approval
tests.
Fixes#20846
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
to_string.hh does not use this header, neither is it obliged to
expose the content of this header. so, let's remove this include.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21440
Calculate a buffer fill hint and pass it to
shard_reader_v2::fill_buffer(), so the underlying buffer-fill can be
optimized to avoid multiple cross shard round-trips, as well as possible
evict-recreate cycles.
The buffer hint mechanism is opt-in, enabled via the new
multishard_reader_buffer_hint parameter.
When the hint is provided, respect it: make sure the returned buffer is
of the requested size, stopping early if the stop_token is seen.
To reduce the amount of possible eviction-recreate cycles while the
buffer is filled, disable auto-pause for the duration of the
fill_reader_buffer() call. For this purpose, auto_pause_disable_guard is
added to evictable_reader_v2.
The hint will tell the shard reader exactly how much data to produce, to
avoid multiple cross-shard round-trips and possible evict-recreate
cycles.
The hint is neither used yet or calculated yet, this is coming in the
next patches.
To reduce the dependency load, replace boost ranges with std::ranges.
Cleanup; no backport.
Closesscylladb/scylladb#21450
* github.com:scylladb/scylladb:
bytes_ostream: replace boost ranges with std ranges
bytes_ostream: extract fragment_iterator into namespace scope
Since Scylla is a public repo, when we create a fork, it doesn't fork the team and permissions (unlike private repos where it does).
When we have a backport PR with conflicts, the developers need to be able to update the branch to fix the conflicts. To do so, we modified the logic of the backport automation as follows:
- Every backport PR (with and without conflicts) will be open directly on the `scylladbbot` fork repo
- When there are conflicts, an email will be sent to the original PR author with an invitation to become a contributor in the `scylladbbot` fork with `push` permissions. This will happen only once if Auther is not a contributor.
- Together with sending the invite, all backport labels will be removed and a comment will be added to the original PR with instructions
- The PR author must add the backport labels after the invitation is accepted
Fixes: https://github.com/scylladb/scylladb/issues/18973Closesscylladb/scylladb#21401
Updates the theme to the latest version to enable tooltips and modifies the db_options.tmpl to show the new role in action.
Closesscylladb/scylladb#21324
Add new dependency pytest-xdist to the toolchain. This will allow executing
boost and unit tests from pytest in parallel, reducing the time
needed for the run.
Closesscylladb/scylladb#21222
test_tablets: add rack decommission test cases
Test scenarios where decommissioing a compelte rack
should succeed, and reproduce scylladb/scylladb#19475
where decommissioning a rack would fail since the
number of remaining racks is insufficient to satisfy
the replication factor, even though the number of nodes
is sufficient, enshrining this behavior.
Refs scylladb/scylladb#19475
* This PR adds unit tests and improves an error message. No backport required.
Closesscylladb/scylladb#20747
* github.com:scylladb/scylladb:
tablet_allocator: improve error message when unable to find replicas when draining
test_tablets: add rack decommission test cases
topology_experimental_raft/test_tablets: get_tablet_count_per_shard_for_host: move shards_count param to be last
test/pylib: ServerInfo: add datacenter and rack attributes
test: everywhere: drop unused imports of ServerInfo
Refs #20716
Adds optional abort_source to all s3 client operations. If provided, will
propagate to actual HTTP client and allow for aborting actual net op.
Note: this uses an abort source per call, not a client-local one.
This is for two reasons:
1.) The usage pattern of the client object is to create it outside the
eventual owning object (task) that hosts the relevant abort source
2.) It is quite possible to want to have different/no abort source for
some operation usage.
* 'gleb/gossip-cleanup-v3' of github.com:scylladb/scylla-dev:
gossiper: start failure_detector_loop on shard 0 only
gossiper: use 1 seconds instead of 1000 milliseconds
gossiper: remove unused code
gossiper: co-routinize do_send_ack2_msg
gossiper: do not needlessly call get_endpoint_state_ptr in handle_major_state_change
gossiper: fix weird logic in get_live_members
gossiper: drop unneeded this->
gossiper: fold get_or_create_endpoint_state into my_endpoint_state
gossiper: co-routinize do_send_ack_msg
C++ concept evaluation rules clash with nested class definition rules
with the result that evaluating concepts about the nested class within
the enclosing class doesn't work. Extract bytes_ostream::fragment_iterator
to avoid that.
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::all_of` and `std::ranges::any_of`
in this change, we replace `boost::algorithm::all_of` and `boost::algorithm::any_of` with
`std::ranges::all_of` and `std::ranges::any_of` respectively.
to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21411
* github.com:scylladb/scylladb:
treewide: s/boost::algorithm::any_of/std::ranges::any_of/
treewide: s/boost::algorithm::all_of/std::ranges::all_of/
In the past, DESC SCHEMA would produce create statements for both the base
and the log table. That was incorrect as the log table is automatically
created alongside the base one. That was solved in scylladb/scylladb@9ab57b1
(scylladb/scylladb#18467).
The mentioned changes implemented the following solution:
* DESC SCHEMA/KEYSPACE/TABLE would still print a create statement for the
CDC base table,
* DESC SCHEMA/KEYSPACE would start printing an alter statement for the
CDC log table. That statement would ensure that the restored log table
has the same parameters as the original one,
* DESC TABLE <base table> would behave as DESC SCHEMA/KEYSPACE, i.e.
it would print a create statement for the base table and an alter
statement for the log table,
* DESC TABLE <log table> would result in an error.
While that solution was good and behaved correctly in the context of
restoring the schema, it had one flaw: describe statement aren't only
used as a means for producing a backup; they also serve an informative
purpose to learn about the schema, e.g. to learn what parameters a specific
table uses. Because we didn't allow for describing CDC log tables, the user
couldn't look them up directly via a describe statement -- they had to
describe the base table for that.
Attempting to describe a log table ended with an error, e.g.:
```
$ DESC TABLE ks.t_scylla_cdc_log;
ks.t_scylla_cdc_log is a cdc log table and it cannot be described directly. Try `DESC TABLE ks.t` to describe cdc base table and it's log table.
```
In these changes, we allow for describing CDC log tables again. The
semantics of the first three bullets above remains unchanged, but
we impose new behavior for DESC TABLE <log table>:
* When the user executes DESC TABLE <log table>, a create statement
will be returned, treating the table as if it were a regular one,
* The create statement will be wrapped in CQL comment markers.
The rationale for the second bullet is that although we want to give the
user a means to look into the structure and options of a CDC log table,
the returned statement is not supposed to be ever executed by them. We
want to minimize the risk of that.
An example of the behavior after the change:
```
$ DESC TABLE ks.t_scylla_cdc_log;
/* Do NOT execute this statement! It's only for informational purposes.
A CDC log table is created automatically when the base is created.
CREATE TABLE ks.t_scylla_cdc_log (
"cdc$stream_id" blob,
"cdc$time" timeuuid,
"cdc$batch_seq_no" int,
"cdc$end_of_batch" boolean,
"cdc$operation" tinyint,
"cdc$ttl" bigint,
p int,
PRIMARY KEY ("cdc$stream_id", "cdc$time", "cdc$batch_seq_no")
) WITH CLUSTERING ORDER BY ("cdc$time" ASC, "cdc$batch_seq_no" ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'enabled': 'false', 'keys': 'NONE', 'rows_per_partition': 'NONE'}
AND comment = 'CDC log for ks.t'
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '60', 'compaction_window_unit': 'MINUTES', 'expired_sstable_check_frequency_seconds': '1800'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99.0PERCENTILE';
*/
```
We also extend the developer documentation regarding DESCRIBE
statements on CDC tables.
Fixesscylladb/scylladb#21235
Backport: these changes are an enhancement, so not needed.
Closesscylladb/scylladb#21228
* github.com:scylladb/scylladb:
docs/dev: Document semantics of describing CDC tables
cql3: Allow for describing CDC log tables
The overload was introduced by a8b14b0227 (utils: add timeout error
injection with lambda), but is only used by the test nowadays.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21377
database.hh has large fan-in and therefore can trigger a lot
of recompilations if included. Replace with smaller dependencies.
Closesscylladb/scylladb#21424
in this series:
- remove unused `#include` in "lang" subdirectory
- add index and lang to CLEANER_DIR
---
cleanup and improvements in the CI, hence no need to backport.
Closesscylladb/scylladb#21437
* github.com:scylladb/scylladb:
.github: add index and lang to CLEANER_DIR
lang: remove unused "#includes"
in addition to the inheritance support, `isinstance()` is also
the recommended way to check for types by PEP8.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21438
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::any_of`.
in this change, we replace `boost::algorithm::any_of` with
`std::ranges::any_of`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::all_of`.
in this change, we replace `boost::algorithm::all_of` with
`std::ranges::all_of`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Recently, seastar rpc started accepting std::type_identity in addition
to boost::type as a type marker (while labeling the latter with an
ominous deprecation warning). Reduce our depedendency on boost
by switching to std::type_identity.
The code adds a node to a set and then removes it if a condition is met.
Add to the set if the condition is not met instead. Note that the
original set never has local endpoint (it is only added locally), so the
code is equivalent.
Test scenarios where decommissioing a compelte rack
should succeed, and reproduce scylladb/scylladb#19475
where decommissioning a rack would fail since the
number of remaining racks is insufficient to satisfy
the replication factor, even though the number of nodes
is sufficient, enshrining this behavior.
Refs scylladb/scylladb#19475
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Terminology note: in the context of this series, "index page" means an contiguous segment of the index file starting (inclusive) at a key corresponding to a summary entry and ending (exclusive) before the key corresponding to the next summary entry. "Index pages" are not related to filesystem pages.
---
In a single-partition read, if the searched partition key is the first key in its index page, we start scanning the index for that key starting at the previous index page (inclusive), even though we could start directly from the key's page. Similarly, if the searched partition key is absent from the sstable and lies after all other keys in its appropriate page, we additionally scan the next page, even though it's known from the summary that it can't possibly contain the key.
Those cases are wasteful. It's worse than it might seem at first glance. When partitions are small, only a small fraction of search keys fulfills those conditions (i.e. "first key in its page" or "an absent key greater than the last key in its page"), so the waste doesn't matter much. But when partitions are big enough, every index page contains only one partition key (and a promoted index for that partition), which directly means that *all* search keys fulfill the conditions, which means that total index reading work is two times bigger than what it should be.
In addition, there is a secondary performance bug which, when the aforementioned conditions are fulfilled, causes *additional* I/O to happen *past* the index reads which are actually parsed and used. In effect, the index I/O in single-partition reads might be not just doubled, but even tripled (that's for IOPS — throughput might be multiplied even more), all because of a slight inaccuracy in the edge cases.
This series fixes those inefficiencies by tightening the edge cases and ensuring that single-partition reads always read only a single index page.
Here's an example where we query the first row (i.e. `LIMIT 1`) of a certain partition key, in a table with large (1 MB) promoted indexes. Before the patch, the lookup of the lower bound involves 3 serialized disk reads (as described above) to subsequent index pages, and even the lookup of the upper bound involves 2 disk reads:
```
Execute CQL3 query
Parsing a statement [shard 0]
Processing a statement for authenticated user: anonymous [shard 0]
Executing read query (reversed false) [shard 0]
Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
read_data: querying locally [shard 0]
Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
[reader concurrency semaphore user] admitted immediately [shard 0]
[reader concurrency semaphore user] executing read [shard 0]
Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38359040 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38391808 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38359040, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38391808, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41390080 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41422848 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41390080, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41422848, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
Querying is done [shard 0]
Done processing - preparing a result [shard 0]
Request complete
```
After the patch, the lookup of each bound involves 1 read:
```
Execute CQL3 query
Parsing a statement [shard 0]
Processing a statement for authenticated user: anonymous [shard 0]
Executing read query (reversed false) [shard 0]
Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
read_data: querying locally [shard 0]
Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
[reader concurrency semaphore user] admitted immediately [shard 0]
[reader concurrency semaphore user] executing read [shard 0]
Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
Querying is done [shard 0]
Done processing - preparing a result [shard 0]
Request complete
```
Doesn't have to be backported, since the problem only affects performance, not correctness, and it has been present since forever.
Closesscylladb/scylladb#20897
* github.com:scylladb/scylladb:
index_reader: remove a piece of misguided code involved in single-partition reads
index_reader: in single-partition reads, don't read more than one page
index_reader: fix unnecessary reads of preceding index pages
Prepare for the next comit that will add a version
accepting a list of servers: `get_tablet_count_per_shard_for_hosts`
for which we want `shards_per_node` to be last and have a default value.
Also, fix the type hint for `full_tables`, as it had a syntax error,
using `:` instead of `,`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.
stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.
Fixesscylladb/scylladb#21159
* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body
Closesscylladb/scylladb#21299
* github.com:scylladb/scylladb:
compaction_manager: stop: await _stop_future if engaged
compaction_manager: really_do_stop: assert that no tasks are left behind
compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
compaction/compaction_manager: stop_tasks(): unlink stopped tasks
compaction/compaction_manager: make _tasks an intrusive list
It's difficult to use nested classes with C++ concepts, since the class
might not be fully defined at the point the concept is evaluated,
resulting in spurious errors (e.g. thinking tokens_iterator is not
default constructible). Move it to namespace scope to reduce pain.
There's a whole lot of helpers and wrappers in api/ that help handlers manipulate keyspaces and tables. One of those is foreach_column_family which calls the provided callable on a table on each shard. There's exactly the same (but a bit more flexible) helper nearby. While at it, this helper gets a better name.
Closesscylladb/scylladb#21398
* github.com:scylladb/scylladb:
api: Rename set_tables -> for_tables_on_all_shards
api: Remove foreach_column_family() helper
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental. What matters is that `_stop_future` is engaged.
While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.
Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.
Fixesscylladb/scylladb#21298
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.
Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
* seastar f821bda19...fba36a3d1 (13):
> build: do not include -DBoost_TEST_DYN_LINK in seastar_testing_cflags
> doc: compatibility: update the notes on supported GCC versions
> docker: bump up to clang {18,19} and gcc {13,14}
> rpc: optimize small tuple deserialization
> rpc: switch rpc::type from boost to std
> thread: do not use fortify source
> build: suppress CMake warning about CMP0057
> core/units: remove space before literal identifier
> signal.md: describe auto signal handling
> build: persist Seastar options in SeastarConfig.cmake
> sharded.hh: seperate invoke_on decls from defs
> test: Add perf test for http client
> gate: check: mark as const
Closesscylladb/scylladb#21390
The hints and batchlog flush requests are issued to all nodes for each repair request when tombstone_gc repair mode is used.
The amount of such flush requests is high when all nodes in the cluster run repair. It is observed it takes a long time, up to 15s, for a repair request to finish such a flush request.
To reduce overhead of the flush, each node caches the flush and only executes the real flush when some time has passed. It is safe to do so before the real flush_time is returned. Repair uses the smallest flush_time from peers as the repair time.
The nice thing about the cache on the receiver side is that all senders can hit the cache. It is better than cache on the sender side.
A slightly smaller flush_time compared to the real flush time will be used with the benefits of significantly dropped hints and batchlog flush. The tradeoff is reasonable.
Fixes#20259
Performance improvement. No backports.
Closesscylladb/scylladb#20260
* github.com:scylladb/scylladb:
test/test_repair.py: Add test_batchlog_flush_in_repair
repair: Reduce hints and batchlog flush
db/batchlog_manager: Add add_delay_to_batch_replay
db/batchlog_manager: Add get_last_replay
db/batchlog_manager: wire in batchlog_replay_cleanup_after_replays
db/config: introduce batchlog_replay_cleanup_after_replays
db/batchlog_manager: do_batch_log_replay(): add cleanup flag
Optimize the various constructors a little, and add an std::from_range_t
constructor.
Minor improvement, so no backports.
Closesscylladb/scylladb#21399
* github.com:scylladb/scylladb:
utils: chunked_vector: add from_range_t constructor
utils: chunked_vector: optimize initializer_list constructor
utils: chunked_vector: iterator constructor: copy spanwise
utils: chunked_vector: reserve for forward iterators, not just random access iterators, on construction
Currently, to find the operation with given id, all operations tracked by a virtual task are listed. This isn't necessary, since we only need info regarding one particular operation.
Add a method to check whether a virtual task tracks the operation with the given id.
No backport needed
Closesscylladb/scylladb#20769
* github.com:scylladb/scylladb:
tasks: delete virtual_task::get_ids method as it is unused
tasks: improve task_manager::lookup_virtual_task
There's a whole lot of helpers and wrappers in api/ that help handlers
manipulate keyspaces and tables. One of those is foreach_column_family
which calls the provided callable on a table on each shard. There's
exactly the same (but a bit more flexible) set_table() helper nearby.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Task manager GET /status method returns two counters that reflect task progress -- total and completed. To make caller reason about their meaning, additionally there's progress_units field next to those counters.
This patch implements this progress report for backup task. The units are bytes, the total counter is total size of files that are being uploaded, and the completed counter is total amount of bytes successfully sent with PUT requests. To get the counters, the client::upload_file() is extended to calculate those.
fixes#20653Closesscylladb/scylladb#21144
* github.com:scylladb/scylladb:
backup_task: Report uploading progress
s3/client: Account upload progress for real
s3/client: Introduce upload_progress
s3: Extract client_fwd.hh
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::transform`.
in this change, we:
- replace `boost::transform` with `std::ranges::transform`
- update affected code to work with `std::ranges::transform`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21318
This pattern is -- if requested (by test) suspend code execution until requestor (the test) explicitly wakes it up. For that the injected place should inject a lambda that is called with so called "handler" at hand and try to read message from the handler. In many cases the inner lambda additionally prints a message into logs that tests waits upon to make sure injection was stepped on. In the end of the day this "breakpoint" is injected like
```
co_await inject("foo", [] (auto& handler) {
log.info("foo waiting");
co_await handler.wait_for_message(timeout);
});
```
This PR makes breakpoints shorter and more unified, like this
```
co_await inject("foo", wait_for_message(timeout));
```
where `wait_for_message` is a wrapper structure used to pick new `inject()` overload.
Closesscylladb/scylladb#21342
* github.com:scylladb/scylladb:
sstables: Use inject(wait_for_message_overload)
treewide,error_injection: Use inject(wait_for_message) and fix tests
treewide,error_injection: Use inject(wait_for_message) overload
error_injection: Add inject() overload with wait_for_message wrapper
std::ranges::to<> has a little protocol with containers. Implement it
to get optimized construction.
Similar to the iterator pair constructor, if the range's size can be
obtained (even with an O(N) algorithm), favor that to avoid reallocations.
Copy elements spanwise to promote optimization to memcpy when possible.
Instead of copying element-by-element, copy contiguous spans. This
is much faster if the input is a span and the constructor is trivial,
since the whole thing translates to a memcpy.
Make the two branches constexpr to reduce work for the compiler in
optimizing the other branch away.
For a forward iterator, prefer a two pass algorithm to first count
the number of elements, reserver, then copy the elements, to a single
pass algorithm that involves reallocation and copying.
`Exception` could be too general, what we really care about is
`GithubException`. so let's catch the latter instead for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21364
The S3 mock server (introduced in 5a96549c) currently prints its status
messages directly to stdout, which can be distracting when reviewing test
results. For example:
```console
$ ./test.py --verbose --mode debug object_store/test_backup::test_simple_backup
Found 1 tests.
Starting S3 mock server on ('127.226.51.1', 2012)
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/1] object_store debug [ PASS ] object_store.test_backup.1 5.99s
Stopping S3 mock server
-------------------------
CPU utilization: 6.5%
```
Move these messages to use proper logging to give developers more control
over their visibility:
- Make logger parameter mandatory in MockS3Server constructor
- Route "Stopping S3 mock server" message through the provided logger
- Add --log-level option to the standalone mock server launcher
The message is now hidden:
```console
$ ./test.py --verbose --mode debug --save-log-on-success object_store/test_backup::test_simple_backup
Found 1 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/1] object_store debug [ PASS ] object_store.test_backup.1 6.25s
------------------------------------------------------------------------------
CPU utilization: 5.5%
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21384
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.
This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.
Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.
Fixesscylladb/scylladb#20060
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21378
In the past, DESC SCHEMA would produce create statements for both the base
and the log table. That was incorrect as the log table is automatically
created alongside the base one. That was solved in scylladb/scylladb@9ab57b1
(scylladb/scylladb#18467).
The mentioned changes implemented the following solution:
* DESC SCHEMA/KEYSPACE/TABLE would still print a create statement for the
CDC base table,
* DESC SCHEMA/KEYSPACE would start printing an alter statement for the
CDC log table. That statement would ensure that the restored log table
has the same parameters as the original one,
* DESC TABLE <base table> would behave as DESC SCHEMA/KEYSPACE, i.e.
it would print a create statement for the base table and an alter
statement for the log table,
* DESC TABLE <log table> would result in an error.
While that solution was good and behaved correctly in the context of
restoring the schema, it had one flaw: describe statement aren't only
used as a means for producing a backup; they also serve an informative
purpose to learn about the schema, e.g. to learn what parameters a specific
table uses. Because we didn't allow for describing CDC log tables, the user
couldn't look them up directly via a describe statement -- they had to
describe the base table for that.
Attempting to describe a log table ended with an error, e.g.:
```
$ DESC TABLE ks.t_scylla_cdc_log;
ks.t_scylla_cdc_log is a cdc log table and it cannot be described directly. Try `DESC TABLE ks.t` to describe cdc base table and it's log table.
```
In these changes, we allow for describing CDC log tables again. The
semantics of the first three bullets above remains unchanged, but
we impose new behavior for DESC TABLE <log table>:
* When the user executes DESC TABLE <log table>, a create statement
will be returned, treating the table as if it were a regular one,
* The create statement will be wrapped in CQL comment markers.
The rationale for the second bullet is that although we want to give the
user a means to look into the structure and options of a CDC log table,
the returned statement is not supposed to be ever executed by them. We
want to minimize the risk of that.
An example of the behavior after the change:
```
$ DESC TABLE ks.t_scylla_cdc_log;
/* Do NOT execute this statement! It's only for informational purposes.
A CDC log table is created automatically when the base is created.
CREATE TABLE ks.t_scylla_cdc_log (
"cdc$stream_id" blob,
"cdc$time" timeuuid,
"cdc$batch_seq_no" int,
"cdc$end_of_batch" boolean,
"cdc$operation" tinyint,
"cdc$ttl" bigint,
p int,
PRIMARY KEY ("cdc$stream_id", "cdc$time", "cdc$batch_seq_no")
) WITH CLUSTERING ORDER BY ("cdc$time" ASC, "cdc$batch_seq_no" ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'enabled': 'false', 'keys': 'NONE', 'rows_per_partition': 'NONE'}
AND comment = 'CDC log for ks.t'
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '60', 'compaction_window_unit': 'MINUTES', 'expired_sstable_check_frequency_seconds': '1800'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99.0PERCENTILE';
*/
```
Fixesscylladb/scylladb#21235
View building is an expensive process that takes a long time to complete.
During the build, it's impact on other work should be minimized, even at
the expense of slightly slowing it down.
Instead, view building is currently performed in the the same scheduling
group (gossip) as other high-priority tasks, in particular raft processing,
which slows it down, making races more likely and increasing the number
of retries that need to be done.
While view building is still initiated in the gossip group (as it's the
result of adding a view, which is a schema change), in this patch the bulk
of the view building work is moved to a low-priority, maintenance scheduling
group (named "streaming" after its main use case).
Additionally, a test is added, where we make sure that the scheduling
group is the one most used when building a view.
Fixes https://github.com/scylladb/scylladb/issues/21232Closesscylladb/scylladb#21326
Today, each test function in test/topology_experimental_raft creates a
cluster in the beginning of the test and drops it at the end of the
function. This is very inefficient if you hope (like I do) to write many
small and pinpointed test functions instead of large test functions that
test 20 unrelated things.
Trying to propose a way to change this sad state of affairs, in
test_alternator.py I created a fixture "alternator3" which I hoped could
be used in multiple tests that need a 3-node Alternator cluster.
Currently only one test uses this fixture.
Unfortunately, it turns out the alternator3 fixture is broken, and
led to flaky test runs (sometimes the test using alternator3 picked
up an existing cluster instead of starting with an empty cluster,
and failed). These problems cannot be *completely* fixed at the current
state of the framework. The framework does not currently allow keeping
a 3-node cluster between test functions, while also allowing other test
functions to create different clusters. The specific flakiness we saw
could be fixed by adding a missing before_test() call, but in the
future we would need to ensure that all the test functions that
use it are contiguous in the test file, and I don't see how we can (or
want to) ensure this. So at this point I am giving up and withdrawing
this proposal until the developers of the topology test framework
make this one of their design goals.
Since there was only one test using this fixture, removing it should
make no performance or correctness difference - it should just fix
the flakiness.
Fixesscylladb/scylladb#21322.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21370
Fixesscylladb/scylladb#21159
When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.
Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.
Closesscylladb/scylladb#21379
Replace use of boost::ranges::join() with another construct, as it
has no std replacement, and replace other uses with their std
equivalent, in order to reduce dependency load.
Code cleanup - no backport.
Closesscylladb/scylladb#21382
* github.com:scylladb/scylladb:
compound_compat: replace use of boost ranges with std ranges
compound_compat: simplify seriakization of ka/la sstables static cell names
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been
confirmed.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21374
* github.com:scylladb/scylladb:
.github: add gms to iwyu's CLEANER_DIR
gms: remove unused `#include`s
The `reader_consumer_v2` type
(`std::function<future<> (mutation_reader)>`) is defined alongside
`mutation_reader` in `mutation_reader.hh`.
before this change, we sometimes use
`std::function<future<> (mutation_reader)>` directly when defining a
consumer parameter or a consumer variable.
in this change, we improve maintainability by:
- Reducing duplicate function type declarations
- Centralizing the consumer type definition
- Making future signature updates easier to implement
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21369
To reduce the dependency load, replace use of boost ranges
with the std equivalent.
Files that lost the indirect boost dependency have it added as a
direct dependency.
compound_compat is used for serializing ka/la sstables static cell names.
Since we can no longer write such sstabkes, the function is used only
in some tests.
Reduce the use of boost::range::join(): it has no direct equivalent
in std (std::views::concat is in C++26), and it is slow due to the
need to type-erase. Instead of using boost::range::join, extend the
vector used to hold the empty clustering key a bit more, and copy
the view representing the static cell name into into it.
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This place could be in the pre-previous patch, it just can use the
overload, but it seemengly has a bug. It prints _two_ messages -- that
the injection handler was suspended and that it was woken up. The bug is
in the 2nd message -- it's printed without waiting for the message, so
it likely gets printed before wakeup itself. It seems that no tests care
about it though.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is continuation of previous patch, this time also update tests that
wait for specific message in logs (to make sure injection handler was
called and paused the code execution).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Many places want to inject a handler that waits for external kick. Now
there's convenience inject() method overload for this. It will result in
extra messages in logs, but so far no code/test cares about it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The wrapper object denotes that injection should run a handler and
wait_for_message() on it. Wrapper carries the timeout used to call the
mentioned method. It's currently unused, next patches will start enjoing
it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Cassandra 4.1 announced a new option to create a role with:
`HASHED PASSWORD`. Example:
```
CREATE ROLE bob WITH HASHED PASSWORD = 'hashed_password';
```
We've already introduced another option following the same
semantics: `SALTED HASH`; example:
```
CREATE ROLE bob WITH SALTED HASH = 'salted_hash';
```
The change hasn't made it to any release yet, so in this commit
we rename it to `HASHED PASSWORD` to be compatible with Cassandra.
Additionally, we adjust existing tests to work against Cassandra too.
Fixesscylladb/scylladb#21350Closesscylladb/scylladb#21352
Currently, lookup_virtual_task gets the list of ids of all operations
tracked by a virtual task and checks whether it contains given id.
The list of all ids isn't required and the check whether one particular
operation id is tracked by the virtual task may be quicker than listing
all operations.
Add virtual_task::contains method and use it in lookup_virtual_task.
Add documentation to clarify the purpose and behavior of
make_interpose_consumer() in the compaction_strategy_impl class. This
method is crucial for building layered processing pipelines but its
semantics were previously undocumented.
The added documentation explains how:
- It decorates end consumers with additional processing steps
- It enables construction of processing pipelines
- The original consumer's semantics are preserved
This improves code maintainability by making the pipeline construction
pattern more apparent to developers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21336
Continue standardization on std::ranges. Since compound contains a custom
iterator, we first have to upgrade it to C++20 iterator concepts.
Cleanup / minor refactoring, so no backport.
Closesscylladb/scylladb#21320
* github.com:scylladb/scylladb:
compound: replace boost ranges with std ranges
compound: upgrade iterator to be an std::forward_iterator
The hints and batchlog flush requests are issued to all nodes for each
repair request when tombstone_gc repair mode is used.
The amount of such flush requests is high when all nodes in the cluster
run repair. It is observed it takes a long time, up to 15s, for a repair
request to finish such a flush request.
To reduce overhead of the flush, each node caches the flush and only
executes the real flush when the cahce time has passed. It is safe to do
so because the real flush_time is returned. Repair uses the smallest
flush_time returned from peers as the repair time.
The nice thing about the cache on the receiver side is that all senders
can hit the cache. It is better than cache on the sender side.
A slightly smaller flush_time compared to the real flush time will be
used with the benefits of significantly dropped hints and batchlog
flush. The trade-off looks reasonable.
Tests: 2 nodes, with 1s batchlog delay:
Before:
Repair nr_repairs=20 cache_time_in_ms=0 total_repair_duration=40.04245328903198
After:
Repair nr_repairs=20 cache_time_in_ms=5000 total_repair_duration=1.252073049545288
Fixes#20259
After the specified amount of replays, trigger a cleanup: flush batchlog
table memtables. This allows the cleanup to happen on a configurable
interval, instead of on every batchlog replay attempt, which might be
too much.
Add a flag controlling whether cleanup (memtable flush) will be done
after the replay. This is to allow repair to opt out from cleanup --
when many concurrenty repairs are running, there can be storms of calles
to do_batch_log_replay(), which will be mostly no-op, but they will all
attempt to flush the memtable to clean-up after themselves. This is
unnecessary and introduces latency to repairs, best to leave the cleanup
to the periodic batch-log replay.
This reverts commit c286434e4c, reversing
changes made to 6712fcc316.
The commit causes memtable_test to be very flaky in debug mode.
Specifically, subtests test_exceptions_in_flush_on_sstable_open
and test_exceptions_in_flush_on_sstable_write).
The @classmethod/@property combination was deprecated in Python 3.11
and removed[1] in Python 3.13. It's used in scylla-gdb.py, breaking it
with Python 3.13.
To fix, just make all users (size_t and _vptr_type) top-level
functions. The definitions are all identical and don't need to be
in class scope.
[1] https://docs.python.org/3.13/library/functions.html#classmethodClosesscylladb/scylladb#21349
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.
Fix it by taking a copy instead. This is a slow path anyway.
Fixes: scylladb/scylladb#21220Closesscylladb/scylladb#21316
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.
I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).
Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.
The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.
Fixes#21325Closesscylladb/scylladb#21323
* github.com:scylladb/scylladb:
utils: cached_file: Mark permit as awaiting on page miss
utils: cached_file: Push resource_unit management down to cached_file
Issue scylladb/scylladb#21114 reported that sometimes during the test we
timeout when waiting for node to restart after it was killed.
Preliminary investigation showed that the node appears to be hanging
inside `topology_state_load`, while holding `token_metadata` lock, which
prevents `join_topology` from progressing.
Enable TRACE level logging for `raft_topology` so we get more accurate
info where inside `topology_state_load` the hang happens, once the
problem reproduces again in CI.
Closesscylladb/scylladb#21247
before this change, these
[convenience libraries](https://www.gnu.org/software/automake/manual/html_node/Libtool-Convenience-Libraries.html)
were implicitly built as static libraries by default,
but weren't explicitly marked as STATIC in CMake. While this worked
with default settings, it could cause issues if `BUILD_SHARED_LIBS` is
enabled.
So before we are ready for building these components as shared
libraries, let's mark all convenience libraries as STATIC for
consistency and to prevent potential issues before we properly support
shared library builds.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21274
Adding an auto-backport.py script to handle backport automation instead of Mergify.
The rules of backport are as follows:
* Merged or Closed PRs with any backport/x.y label (one or more) and promoted-to-master label
* Backport PR will be automatically assigned to the original PR author
* In case of conflicts the backport PR will be open in the original autoor fork in draft mode. This will give the PR owner the option to resolve conflicts and push those changes to the PR branch (Today in Scylla when we have conflicts, the developers are forced to open another PR and manually close the backport PR opened by Mergify)
* Fixing cherry-pick the wrong commit SHA. With the new script, we always take the SHA from the stable branch
* Support backport for enterprise releases (from Enterprise branch)
Fixes: https://github.com/scylladb/scylladb/issues/18973Closesscylladb/scylladb#21302
Do it by passing reference to s3::upload_progress_monitor object that
sits on task impl itself. Different files' uploads would then update the
monitor with their sizes and uploaded counters. The structure is
reported by get_progress() method. Unit size is set to be bytes. Test is
updated.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before upload starts file size is checked, so this is the place that
updates progress.total counter.
Uploading a file happens by reading unit_size bytes from file input
stream and writing the buffer into http body writer stream. This is the
place to update progress.uploaded counter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is a structure with "total" and "uploaded" counters that's passed
by user to client::upload_file() method so that client would update it
with the progress.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to export some simple structures to users without the need to
include client.hh itself (rather large already)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Its handler dereferences long chain of objects to get to the value it needs. There's shorter way.
Also, the endpoint in question is not unregistered on stop.
Closesscylladb/scylladb#21279
* github.com:scylladb/scylladb:
api: Make get_highest_supported_sstable_version use proper service
api: Move system::get_highest_supported_sstable_version set/unset
api: Scaffold for sstables-format-selector
Most of inject() overloads check if the injection is enabled, then
optionally clear the one-shot one, then do the injection. Everything
but doing the injection is implemented in the enter() method, it's
perfectly worth re-using one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21285
Previously in e65185ba, when merging Seastar's and ScyllaDB's compilation
databases, the "prefix" parameter in merge-compdb.py was too restrictive.
It only included build rules for files with "CMakeFiles" prefix, excluding
source files in subdirectories like `apps/iotune/CMakeFiles/app_iotune.dir/iotune.cc.o`.
In this change, we change the prefix parameter to an empty string to
include all source files whose object files are located under build directories, regardless of their path structure.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21312
Separate the configuration for enabling the tablets feature from the enablement of tablets when creating new keyspaces.
This change always enables the TABLETS cluster feature and the tablets logic respectively.
The `enable_tablets` config option just controls whether tablets are enabled or disabled by default for new keyspaces.
If `enable_tablets` is set to `true`, tablets can be disabled using `CREATE KEYSPACE WITH tablets = { 'enabled': false }` as it is today.
If `enable_tablets` is set to `false`, tablets can be enabled using `CREATE KEYSPACE WITH tablets = { 'enabled': true }`.
The motivation for this change is to simplify the user experience of using tablets by setting the default for new keyspaces to false amd allowing the user to simply opt-in by using tablets = {enabled: true }.
This is not pissible today.
The user has to enable tablets by default for all new keyspaces (that use the NetworkTopologyStrategy) and then actively opt-out to use vnodes.
* Not required to be backported to OSS versions. May be backported to specific enterprise versions
Closesscylladb/scylladb#20729
* github.com:scylladb/scylladb:
data_dictionary: keyspace_metadata::describe: print tablets enabled also when defaulted
tablets_test: test enable/disable tablets when creating a new keyspace
treewide: always allow tablets keyspaces
feature_service: prevent enabling both tablets and gossip topology changes
alternator: create_keyspace_metadata: enable tablets using feature_service
This patch adds the option "--release <version>" to test/cql-pytest/run,
which downloads the pre-compiled Scylla release with the given version
number and runs the tests against that version. For example, it can be used
to demonstrate that #15559 was indeed a regression between 2022.1 and 2022.2,
by running a recently-added test against these two old versions:
test/cql-pytest/run --release 2022.1 --runxfail \
test_prepare.py::test_duplicate_named_bind_marker_prepared
test/cql-pytest/run --release 2022.2 --runxfail \
test_prepare.py::test_duplicate_named_bind_marker_prepared
The first run passes, the second fails - showing the regression.
The Scylla releases are downloaded from ScyllaDB's S3 bucket
(downloads.scylladb.com). They are saved in the build/ directory
(e.g., build/2022.2.9), and if that directory is not removed, when
"run --release" requests the same version again, the previous download
is reused.
Release numbers can look like:
* 5.4.7
* 5.4 (will get the latest in the 5.4 branch, e.g., 5.4.7)
* 5.4.0~rc2 (a prerelease)
* 2021.1.9 (Enterprise release)
* 2023.1 (latest in this branch, Enterprise release)
Fixes#13189
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19228
before this change, Seastar enables CXX_EXTENSIONS in its own
build rules. but it does not expose it to the parent project. but
scylladb's CMake building system respect seastar's .pc file and
includes the cflags exposed by it. without this change, scylladb
included "-std=c++23" from seastar, and "-std=gnu++23" from itself.
this is both confusing and inconsistent with the build rules generated
by `configure.py`.
in this change, we explicitly set `CMAKE_CXX_EXTENSIONS` when creating
Seastar's building rules, so that it can populate this setting to its
.pc file. in this way, we don't have two different options for
specifying the C++ standard when building scylladb with CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21311
The workaround was initially added to silence warnings on GCC < 6.4 for ARM
platforms due to a compiler bug (gcc.gnu.org/bugzilla/show_bug.cgi?id=77728).
Since our codebase now requires modern GCC versions for coroutine support,
and the bug was fixed in GCC 6.4+, this workaround is no longer needed.
Refs 193d1942f2
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21308
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.
There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.
The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.
We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately. This is especially important
for single-row reads, where the bounds are around the same key. In
this case we want to read the data file range which belongs to a
single promoted index block. It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache. Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.
Fixes#10030.
The change was tested with perf-fast-forward.
I populated the data set with `column_index_size_in_kb` set to 1
scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1
Test run:
build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0
This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.
Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.
Also, the first read which misses in cache is still faster after the change.
Before:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4%
500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0%
```
After:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1%
500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1%
```
Backports: none, not a regression
Closesscylladb/scylladb#20522
* github.com:scylladb/scylladb:
perf: perf_fast_forward: Add test case for querying missing rows
perf-fast-forward: Allow overriding promoted index block size
perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
sstables: bsearch_clustered_cursor: Add more tracing points
sstables: reader: Log data file range
sstables: bsearch_clustered_cursor: Unify skip_info logging
sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
sstables: bsearch_clustered_cursor: Skip even to the first block
test: sstables: sstable_3_x_test: Improve failure message
sstables: mx: writer: Never include partition_end marker in promoted index block width
sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
sstables: clustered_cursor: Track current block
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.
I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.
The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.
Standardize on the standard range library.
The serialize_value(initializer_list) overload is disambiguated
not to call itself. Apparently it wasn't called before.
Since std::ranges::subrange does not provide operator==, replace
it with std::ranges::equals().
compound::iterator isn't far from a forward_iterator, and if we want
to use it with std::ranges, we have to upgrade it. This is because
std::ranges::subrange() only provides front() for forward ranges, and
we do use this front(). Boost apparently isn't as strict.
To make it a forward_range, we have to drop operator-> and make
operator* return a value (similar to std::views::tranform), since
forward iterators require that pointers and references be stable,
and this iterator returns a pointer to one of its members.
We also add an iterator_concept member to declare the compatibility
to std::ranges.
In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.
Fixes: scylladb/scylladb#19849Fixes: scylladb/scylladb#17857Closesscylladb/scylladb#20909
* github.com:scylladb/scylladb:
fix nodetool status to show zero-token nodes
test: move `wait_for_first_completed` to pylib/util.py
token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
before this change, we only record the exception returned
by `upload_file()`, and rethrow the exception. but the exception
thrown by `update_file()` not populated to its caller. instead, the
exceptional future is ignored on pupose -- we need to perform
the uploads in parallel. this is why the task is not marked fail
even if some of the uploads performed by it fail.
in this change, we
- coroutinize `backup_task_impl::do_backup()`. strictly speaking,
this is not necessary to populate the exception. but, in order
to ensure that the possible exception is captured before the
gate is closed, and to reduce the intentation, the teardown
steps are performed explicitly.
- in addition to note down the exception in the logging message,
we also store it in a local variable, which it rethrown
before this function returns.
Fixesscylladb/scylladb#21248
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21254
in 8d1b3223, we removed some unused "#include"s, but we failed to
address all of them in "dht" subdirectory. and the unaddressed
"#include"s are identified by the iwyu workflow.
in this change, we address the leftovers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21291
This commit improves the README file so that it's more helpful
to documentation contributors. Especially, it:
- Adds the link to the prerequisites.
- Add information on troubleshooting (checking the links, headings, etc.)
- Removes the section on creating a knowledge base article, as we no longer
promote adding KBs in favor of creating a coherent documentation set.
Fixes https://github.com/scylladb/scylladb/issues/21257Closesscylladb/scylladb#21262
This commit updates the configuration for ScyllaDB documentation so that:
- 6.2 is the latest version.
- 6.2 is removed from the list of unstable versions.
It must be merged when ScyllaDB 6.2 is released.
In addition, this commit uncomments the redirections that should be applied
when version 6.2 is the latest stable version (which will happen when this commit
is merged).
No backport is required.
Closesscylladb/scylladb#21133
This endpoint now grabs one via database -> table -> sstables manager
chain, but there's shorter route, namely via sstables format selector.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's currently registered with all other system endpoints and is not
unregistered. Its correct place is in the sstables-format-selector
set/unset functions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fix how regular tasks that have a virtual parent are created
in task_manager::module::make_task: set sequence number
of a task and subscribe to module's abort source.
Fixes: #21278.
Needs backport to 6.2
Closesscylladb/scylladb#21280
* github.com:scylladb/scylladb:
tasks: fix sequence number assignment
tasks: fix abort source subscription of virtual task's child
Currently, test_repair_succeeds_with_unitialized_bm checks whether
repair finishes successfully and the error is properly handled
if batchlog_manager isn't initialized. Error handling depends on
logs, making the test fragile to external conditions and flaky.
Drop the error handling check, successful repair is a sufficient
passing condition.
Fixes: #21167.
Closesscylladb/scylladb#21208
Commit 3a12ad96c7
added an sstable_identifier uuid to the SSTable
scylla_metadata component, however it was
under-documented and this patch adds the missing
documentation for the sstable component format,
and to the scylla sstable tool documentation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21221
in cc3953e5, we disabled Seastar exception hack in configure.py.
this change disabled the Seastar exception hack in the following
two builds:
- build generated directly by configure.py
- build configured with multi-config generator using CMake
but we also have non-multi-config build using CMake. to be more
consistent, let's apply the equivalent change to non-multi-config
build of CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21233
The skipped ranges should be multiplied by the number of tables
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174Closesscylladb/scylladb#21252
* github.com:scylladb/scylladb:
test: Add test_node_ops_metrics.py
repair: Make the ranges more consistent in the log
repair: Fix finished ranges metrics for removenode
Since commit 415c83fa, Seastar is built as an external project. As a result,
the compile_commands.json file generated by ScyllaDB's CMake build system no
longer contains compilation rules for Seastar's object files. This limitation
prevents tools from performing static analysis using the complete dependency
tree of translation units.
This change merges Seastar's compilation database with ScyllaDB's and places
the combined database in the source root directory, maintaining backward
compatibility.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21234
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.
By default, the monitoring stack queries it once every 60s.
Closesscylladb/scylladb#21165
before this change, the "dist" targets are always enabled in the
CMake-based building system. but the build rules generated by
`configure.py` does respect `--enable-dist` and `--disable-dist`
command line options, and enable/distable the dist targets
respectively.
in this change, we
- add an CMake option named "Scylla_DIST". the "dist"
subdirectory in CMake only if this option is ON.
- pouplate the `--enable-dist` and `--disable-dist` option
down to cmake by setting the `Scylla_DIST` option,
when creating the build system using CMake.
this enables the CMake-based build system to be functionality
wise more closer to the legacy building system.
Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21253
now that we are allowed to use C++23. we now have the luxury of using
`std::views::values`.
in this change, we:
- replace `boost::adaptors::map_values` with `std::views::values`
- update affected code to work with `std::views::values`
- the places where we use `boost::join()` are not changed, because
we cannot use `std::views::concat` yet. this helper is only
available in C++26.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21265
Currently, running the `nodetool compactionhistory` command or using the rest api `curl -X GET --header "Accept: application/json" "http://localhost:10000/compaction_manager/compaction_history"` return compaction history without the `row_merged` field.
The series computes rows merged during compaction and provides this information to users via both the nodetool command and the rest api. The `rows_merged` field contains information on merged clustering keys across multiple sstable files. For instance, compacting two sstables of a table consisting of 7 rows where two rows are part of the both sstables, the output would have the following format: {1: 5, 2: 2}.
No backport is required. It extends the existing compaction history output.
Fixes https://github.com/scylladb/scylladb/issues/666Closesscylladb/scylladb#20481
* github.com:scylladb/scylladb:
test/rest_api: Add tests for compactionhistory
nodetool: Add rows merged stats into compactionhistory output
compaction: Update compaction history with collected histogram
compaction: Remove const qualifier from methods creating sstable readers
sstable_set: Add optional statistics to make_local_shard_sstable_reader
make_combined_reader: Add optional parameter, combined_reader_statistics
reader_selector: Extend with maximum reader count
mutation_fragment_merger: Create histogram while consuming mutation fragment batches
Since `with_deserialized()` returns the lambda function's result, we can
directly return the bucket from within the lambda instead of relying on
side effects. This makes the code more explicit and functional.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21273
A worry was raised that an unprivileged user might be able to read the
system.roles table - which contains the Alternator secret keys (and also
CQL's hashed passwords). This patch adds tests that show that this worry
is unjustified - and acts as a regression test to ensure it never
becomes justified. The tests show that an unprivileged user cannot read
the system.roles table using either CQL or Alternator APIs.
More specifically, the two tests in this patch demonstrate that:
* The Alternator API does not allow an unprivileged user to read ANY system
table, unless explicitly granted permissions for that table.
* The CQL API whitelists (see service::client_state::has_access) specific
system tables - e.g., system_schema.tables - that are made readable to any
unprivileged user. But the system.auth table is NOT whitelisted in this
way - and is unreadable to unprivileged users unless explicitly granted
permissions on that table.
The new tests passes on both Scylla and Casssandra.
Refs #5206 (that issue is about removing the Alternator secret keys from
the roles table - but stealing CQL salted hashes is still pretty bad, so
it's good to know that unprivileged users can't read them).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21215
While documenting materialized view in a new document (Refs #16569)
I encountered a few questions on how various CQL operations work on
a table that has views, and this patch contains tests that clarify their
answer - and can later guarantee that the answer doesn't unintentionally
change in the future. The questions that these tests answer are:
1. That TRUNCATE on a base table also TRUNCATEs its views. This is just
a basic test, with no attempt to reproduce issue #17635 (which is
about the truncation of the base and views not being atomic).
2. That DROP TABLE is *not allowed* on a base table that has views.
3. That DROP KEYSPACE is allowed, even if there are tables with views.
4. Test that ALTER TABLE tbl DROP is never allowed in Cassandra, but
allowed in some cases by Scylla
5. Test that ALTER TABLE tbl ADD is allowed, and "SELECT *" expands to
select the new column into the materialized view as well.
All the new tests pass on both Scylla and Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21142
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.
Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.
Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264Closesscylladb/scylladb#21059
There's a `missing_summary_first_last_sane` test case that uses some very specific way of modifying an sstable -- it loads one from resources, then tries to "write" the loaded stuff elsewhere. For that it uses a special purpose test::store() helper and a bunch of auxiliary ones from the same class. Those aux helpers are not used anywhere else and are also very special for this test case, so it make sense to keep this whole functionality in a single helper.
Closesscylladb/scylladb#21255
* github.com:scylladb/scylladb:
test: Squash test::change_generation_number() into test::store()
test: Squash test::change_dir() into test::store()
test: Coroutinize sstables::test::store()
This commit adds a new test case 'test_group_by_static_column_and_tombstones'
to verify the behavior of GROUP BY queries with static columns. The test is
adapted from Cassandra's test suite and aims to reproduce issue #21267.
Original, larger test:
cassandra_tests/validation/operations/select_group_by_test.py::testGroupByWithPaging()
Closesscylladb/scylladb#21270
Currently, if a regular task does not have a parent or its parent
is a virtual tasks then it subscribes to module's abort source
in task_manager::task::impl constructor. However, at this point
the kind of the task's parent isn't set. Due to that, children
of virtual tasks aren't aborted on shutdown.
Subscribe to module's abort source in task::impl::set_virtual_parent.
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.
The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.
Fixes: scylladb/scylladb#19849
Rename host_id map getter, 'get_endpoint_to_host_id_map_for_reading' to 'get_endpoint_to_host_id_map_'
Also modify the getter to return information regarding joining nodes as well.
This getter will later be used for retrieving the nodes in nodetool status, hence it needs to show all nodes,
including joining ones.
The function name suffix `_for_reading` suggests that the function was used
in some other places in the past, and indeed if we need endpoints
"for reading" then we cannot show joining endpoints. But it was confirmed
that this function is currently only used by "/storage_service/host_id" endpoint,
hence it can be modified as required.
Fixes: scylladb/scylladb#17857
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been
confirmed.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21237
* github.com:scylladb/scylladb:
.github: add dht to iwyu's CLEANER_DIR
dht: remove unused `#include`s
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now that tablets may be explicitly enabled when
creating a new keyspace, describe tablets as enabled
even when the default initial_tablets==0 is used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test both configuration values for `enable_tablets`
and the possibility to explicitly enable or disable
tablets, respectively, when creating a keyspace using the
`tablets = {'enabled': true|false}` CREATE KEYSPACE option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the tablets feature always enabled (Unless gossip toopology
changes are forced), the enable_tablets option now controls only
the default for newly created keyspaces.
Even when set to `false`, tablets are still enabled as a
feature and the user may explicitly enable tablets
using `CREATE KEYSPACE <name> WITH tablets = {'enabled': true}`
Note: best viewed with `git show -w`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Tablets require raft consistent topology changes.
Therefore, document that they are incompatible in
the config help and prevent their usage in
`feature_config_from_db_config`
Fixesscylladb/scylladb#21075
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
in 6ead5a46, we included submodule changes in cqlsh and java by accident.
this was not intended. and this broke the artifacts-rocky8-test.
in this change, both changes in the submodule are reverted.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21236
this series:
- promote object storage configuration to user-facing documentation
- reference object storage config doc from nodetool commands
---
the nodetool backup/restore commands are not included by any LTS branches yet, hence no need to backport.
Closesscylladb/scylladb#21071
* github.com:scylladb/scylladb:
docs: move keyspace-storage-option from cql-extensions to admin
docs: reference admin.rst for object storage config
docs: reference object storage config doc from nodetool commands
docs: promote object storage configuration to user-facing documentation
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.
Fixesscylladb/scylladb#21044Closesscylladb/scylladb#21045
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We introduce an auxiliary type representing a service level for making
it easier to adjust the tests in Enterprise. We move the responsibility
of producing create statements for service levels to the class, so we
only need to modify the code in one place when necessary.
All existing relevant tests have been adjusted to this change.
Closesscylladb/scylladb#21230
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.
Fixes: scylladb/scylladb#21102
Should be backported to every 6.x branch, as it may lead to a crash.
Closesscylladb/scylladb#21121
* github.com:scylladb/scylladb:
test: add UT to test retrying ALTER tablets KEYSPACE
cql/tablets: fix indentation in `rf_change` event handler
cql/tablets: fix retrying ALTER tablets KEYSPACE
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
File uploading code spawns all parts uploading into background. If this "spawning" fails (not the uploading code itself), any fiber that was spawned before is orphaned. It will eventually stop on its own, by while it's alive it may use(-after-free) the do_upload_file object.
Another issue with not handling spawn exception, is that multipart upload object is not aborted in this case. So it's leaked until garbage collector picks it up, which is not critical, but unpleasant.
Closesscylladb/scylladb#21139
* github.com:scylladb/scylladb:
s3/client: Restore indentation after previous patch
s3/client: Catch do_upload_file::upload_part() exceptions
* Add `--max-failures` flag to test.py, which will stop the execution after number of failures
* Helps with "fails-fast" approach and can be used to improve CI speed, especially the 100times run
* Adds the number of cancelled tests to both summary and junit xml. I did not include them in boost, since it does not contain any statistics.
* Removes unnecessary list creation in test.py
* Completely unrelated change, but it is small enough that I feel it can be included as part of this one. If this is an issue I can create separate PR for it
* Add `Test.started` property
* Helps with determining the current status of the Test and differentiating cancelled/not started tests.
* Add `Test.failed` and `Test.did_not_run` read-only computed properties
* Helper methods to determine status, instead of using `Test.success`, which does not tell the entire story
* Fix `ScyllaClusterManager.stop()` method, so it doesn't fail when ran multiple times
* This happens when tasks are cancelled, not sure yet why, it almost certainly non-wanted behaviour but this behaviour was already there and with this fix it no longer causes errors
I will use backport/None for now as it is a new feature.
Fixes https://github.com/scylladb/qa-tasks/issues/1714Closesscylladb/scylladb#21098
* github.com:scylladb/scylladb:
test.py: Add option to fail after number of failures
test.py: Add started, failed and did_not_run properties to Test
test.py: Remove unnecessary list creation
test: lib: Fix ScyllaClusterManager.stop()
`mutation_reader::is_end_of_stream()` returns
`_impl->is_end_of_stream() && is_buffer_empty()`, so
`!is_end_of_stream()` equals to `
`!_impl->is_end_of_stream() || !is_buffer_empty()`, which in turn
always equals to
`!_impl->is_end_of_stream() || !is_buffer_empty() || !is_buffer_empty()`.
hence there is no need to check `rd.is_buffer_empty()` again.
in this change, the redundant condition is dropped. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21224
When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction.
This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload.
The issue of view updates causing increased latencies most often occurs in the following scenario:
* we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows
* we have any read workload
* one replica is slower or is handling more writes due to an imbalance of data distribution
* we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it.
* each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued
* the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies
An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario:
* the tables were prepared satisfying the conditions above
* we use a medium write workload and a very low read workload
* the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update
* to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond.
In the test above:
* without the fix, the latency of reads increases over 50s
* with the fix, the latency of reads stays below 20ms
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
The patch is not that small and it isn't fixing a regression, so no backports
Closesscylladb/scylladb#20887
* github.com:scylladb/scylladb:
test: add test for high view update concurrency causing bad_allocs
test: add test for high view update concurrency degrading read latency
mv: add a dedicated read concurrency semaphore for view update read before writes
Before python 3.12 formatted strings couldn't have reused quotes.
Change the type of quotation mark in get_cgroup so it could be
used with earlier python versions.
Closesscylladb/scylladb#21209
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.
Fixes: scylladb/scylladb#21102
The all_datadirs keeps paths to directories where local sstables can be. In fact, Scylla doesn't put sstables there, but can try to find them on boot and when checking snapshots. The 0th element of this vector, called datadir, had recently been removed by #20675, now it's time to drop all_datadirs as well. The needed paths can be obtained from table's storage options (see #20542) and db::config::data_file_directories option.
Closesscylladb/scylladb#21212
* github.com:scylladb/scylladb:
sstables: Open-code format_table_directory_name() moved recently
replica,sstables: Move format_table_directory_name()
table: Remove all_datadirs
sstables: Generate table::all_datadirs from db::config and storage_options
replica: Prepare vector of fs::path-s with table dirs
table: Check storage options in get_snapshot_details()
Rewrite
_begin + n > _capacity_end
as
n > _capacity_end - _begin
and then as
n > capacity()
for two reasons:
- The last form is easier to read than the first form.
- Per N4950 (the final C++23 working draft), [expr.add] paragraph 4, the
expression
_begin + n (i.e., P + J)
is defined only if
0 ≤ 0 + n ≤ _capacity_end - _begin (i.e., 0 ≤ i + j ≤ n)
equivalently, only if
_begin ≤ _begin + n ≤ _capacity_end
Therefore, the expression
_begin + n
invokes undefined behavior exactly when we'd expect our check
_begin + n > _capacity_end
to evaluate to true.
gcc and clang have been aggressively equating undefined behavior to "never
happens"; let's prevent that here.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#21213
It's somewhat common to ask for the partition key and clustering key
columns, or for the static and regular columsn. Provide accessors for them
rather than requiring the user to glue them.
Some callers are converted.
Closesscylladb/scylladb#21191
Add --max-failures configuration option to specify the amount,
if not set, or not positive, it will never trigger.
Update also the junit reporting to include skipped tests
instead of repeating it in cql-extensions.md, let's reference
the object storage related settings in admin.rst
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Enhance the documentation for nodetool commands that use the `--endpoint`
option by linking to the object storage configuration guide. This change
provides users with essential context and detailed setup instructions
for S3-compatible storage endpoints.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this commit moves the object storage configuration guide from the developer
documentation to the user-facing admin documentation. the change reflects
the increasing importance of object storage integration in user-facing
features.
in this change:
- move relevant content from `docs/dev/object_storage.md` to
`docs/operating-scylla/admin.rst`
- reformat the content from Markdown to reStructuredText (RST)
- reword and restructure the content to be more user-friendly
- add explanations and context suitable for a broader audience
this change makes the object storage configuration information more
accessible to Scylla administrators and end-users, supporting the adoption
of new features built on top of object storage integration.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* Do not leave passed tests in between failed ones.
* Use ANSI Escape sequences for manipulating console
* Simplifies code and removes need for two object parameters
Closesscylladb/scylladb#21176
For a table with NullCompactionStrategy and
TimeWindowCompactionStrategy, the test
- inserts a bunch of data and flushes the table
- deletes/update some data, delete a range of data and flushes
the table
- Triggers a major compaction and calls for compactionhistory
to retrieve and validate the histogram
Incorporate rows merged statistics into the output of the compactionhistory
command. Depending on the requested format type, the output has
different form.
For instance, compacting two sstables of a table consisting of 7
rows where two rows are part of the both sstables, the output
would have the following format:
text: {1: 5, 2: 2}
json: [{"key":1,"value":5},{"key":2,"value":1}]}
yaml: - key: 1
value: 5
- key: 2
value: 1
A new field has been added to the compaction_stats structure to hold
collected combined reader statistics. The struct is than used to update
the compaction_history table.
Compaction classes start mutate their internal members to be used
in methods setup_sstable_reader and make_sstable_reader creating
sstable reades that are marked as const.
Remove the const qualifier from these methods. Even though it made
sense initially to mark them as const, it is no longer applicable.
The pointer to combined_reader_statistics is propagated down to
make_combined_reader in order to collect statistics. By default,
a null pointer is propagated.
Note that in case the pointer is valid and the sstable_set consists
of exactly one sstable, statistics are skipped as all rows originate
from exactly a single sstable file.
The existing optimization is crucial f75154afca
All the overloaded make_combined_reader functions accept an optional
pointer to combined_reader_statistics, to be propagated down through
merging_reader to mutation_fragment_merger. By default, a null pointer
is propagated.
The maximum reader count allows to predict the number of readers
that can be created with create_new_readers(). This helps to
correctly allocate a vector size in the rows_merged statistics
when a combiner reader is created via make_combined_reader.
The mutation_fragment_merger takes one additional parameter in its
constructor, that is a pointer to a combined_reader_statistics
used to collect various statistics.
The histogram is populated with data while the merger consumes
batches from the producer and merges them into seperate mutation
fragments. The size of the batch, that represents the number of streams
the mutation fragment originates from, is used as a key in the
historgam and its corresponding value is increased by one.
get_description.py script is a document related script that looks for metrics description in the code.
Its configuration needs to address changes in the code.
This series contains a configuration change and a code fix that allows it to run as a standalone script, and not as a library.
No need to backport, this a documentation related script.
Closesscylladb/scylladb#19950
* github.com:scylladb/scylladb:
scripts/get_description.py: param_mapping was missing
scripts/metrics-config.yml: no need to get metrics from the tests
in 787ea4b1d4, we added "sstables" argument to the "nodetool restore"
command. but we failed to update the document to reflect the change.
in this change, we update the document for "restore" command to reflect
the latest implementation changes introduced in commit 787ea4b1d4:
* Add information about the new "sstables" argument
* Update command line usage of "--table" argument -- it is now madatory
* Update the example accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21135
before this change, we build some tests as if they are Seastar tests.
but after 415c83fa, these tests failed to link. because the
Seastar::seastar_testing does not expose `-DSEASTAR_TESTING_MAIN` in
its cflags. the behavior of the Seastar::seastar_testing is expected.
because a test linking against this library is not necessarily driven
by the `main()` provided by `testing/seastar_test.hh`.
so, in this change, we correct the `KIND` parameter of these tests,
so that they use `KIND BOOST`, as these tests can be driven by the
`main()` provided by Boost.Test's driver. also there are some tests
driven by Boost.Test's `main()`, but in the meanwhile, they utilize
seastar_testing, so let's add `Seastar::seastar_testing` to their
`LIBRARIES`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21183
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.
in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in check_headers.cmake, we verify the self containness of a header file
by replicating it and remove `#pragma once` directive in this header.
but this approach failed to compile headers which include a header file
with the same name in the root source directory, as we add
`-I<directory-of-original-header>` in the cflags when building the
generated source file, so that it can include the headers in the same
directory. but this confuses the compiler, as, assuming we have "log.hh"
in current directory, and under the root source directory, the compiler
would always include the "log.hh" in the current directory even it
should have included "log.hh" under the root source directory.
in this change, instead of adding `-I<directory-of-original-header>` to
cflags, we just include the header under test in a new .cc file
solely generated for testing. this should address this problem.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21216
since chrono allows dividion between durations with different units. let
use it instead for rounding down to the nearest multiple of the window
size, for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20476
This helper is small enough and it's easier to understand how table
directory name is formatted without it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now this helper is not needed in replica code, as all manipulations of
tables' sstables now sit in the sstables/storage.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's write-only now, all the places than wanted to know where table's
storage is (well -- "are", there can be several directories) already use
storage_options.
This finishes the work started by 9fe64b5d70.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
As mentioned in the previous patch, there are several places that need
to scan all datafile directories for a given table. This list is
currently stored on table.config.all_datadirs, this patch stops using
one and instead generates it from db::config::data_file_directories and
table's storage options.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the time table with local storage keeps its sstables in a single
directory referenced by its storage_options::local.dir path. However,
there are two cases when code needs to check all datafile directories
that could be configured -- on boot when distributed loader loads
sstables, and when checking table snapshots.
Both those places check table.cfg.all_datadirs vector of strings and
convert strings to fs::path-s along the way. This patch prepares the
vector of fs::path-s in advance and updates the loop code to work with
path-s.
This is preparation to next patching that will generate vector of paths
for a table.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is continuation of 24589cf00c and a734fd5c9c -- if table is not
based on local storage, getting snapshot details makes no sense.
Another goal this change pursuits is to have storage_options::local
object at hand to be used later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit add a test for checking whether a large view update workload
can cause Scylla to run out of memory.
In the test, we keep writing to a table table with a materialized view
with a limited number of rows, causing overwrites which require reading
from the table to perform view updates.
Currently, due to the unlimited concurrency of view update reads, we may
use too much memory which can lead to bad_allocs, causing Scylla to fail.
To reach the failing state more consistently, we use add a sleep after
reading the old value of the base row, to keep the reader concurrency
semaphore units longer. At the same time, we use high concurrency and
large row size to use up all Scylla's memory quickly.
The test fails if Scylla runs out of memory and aborts, and succeeds
otherwise.
This commit add a test for checking whether a large view update workload
impacts the latency of other user reads.
In the test, we first create a table for reads and another table with
a materialized view. We then start writing to the table with the view
with a limited number of rows - when overwriting, we need to read the
previous value of the row to prepare a delete of the old row in the view.
This should not impact the latency of the read workload from the other
table that we start at the same time. The test fails if any of the reads
times out.
To reach the failing state more consistantly, we use add a sleep after
reading the old value of the base row, to keep the reader concurrency
semaphore units longer. At the same time, we use a lower threshold for
queueing reads on the semaphore, to see the impact of view update reads
earlier.
Because of the high load, the writes may timeout, but that's expected
- we fail the test only if the user reads time out.
now that we are allowed to use C++23. we now have the luxury of using
`std::views::keys`.
in this change, we:
- replace `boost::adaptors::map_keys` with `std::views::keys`
- update affected code to work with `std::views::keys`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21198
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
when compiling date.h, clang 20 complains:
```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/build/rust -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Debug/seastar/gen/include -isystem /usr/include/p11-kit-1 -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -std=c++23 -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -DWITH_GZFILEOP -MD -MT lang/CMakeFiles/lang.dir/Debug/lua.cc.o -MF lang/CMakeFiles/lang.dir/Debug/lua.cc.o.d -o lang/CMakeFiles/lang.dir/Debug/lua.cc.o -c /home/kefu/dev/scylladb/lang/lua.cc
In file included from /home/kefu/dev/scylladb/lang/lua.cc:18:
/home/kefu/dev/scylladb/utils/date.h:836:34: error: identifier '_d' preceded by whitespace in a literal operator declaration is deprecated [-Werror,-Wdeprecated-literal-operator]
836 | CONSTCD11 date::day operator "" _d(unsigned long long d) NOEXCEPT;
| ~~~~~~~~~~~~^~
| operator""_d
```
because, in
[CWG2521](https://wg21.link/CWG2521), it proposes that compiler should
consider
```c++
string operator "" _i18n(const char*, std::size_t); // OK, deprecated
```
as "OK, deprecated".
and Clang implemented this proposal, as it was accepted by C++23. since
scylladb uses C++23 standard. let's remove the space between `"` and
`_` to be more compliant to the C++23 standard and to silence the
warning, which is taken as an error.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21194
in 415c83fa, we introduced a regression which broke the build of
target of "package". because
- the IMPORT_LOCATION_<CONFIG> of the imported target of
"Seastar::iotune" includes a literal `$<CONFIG>`
- we retrieve the property named "IMPORTED_LOCATION" from
this target. but value of this property is empty.
so, when we copied this file, the "src" parameter passed to
`cmake -E copy` is actually an empty string.
in this change, we
- set the `IMPORTED_LOCATION_${CONFIG}` property with a
correct path.
- retrieve the property with the right approach -- to use
`TARGET_FILE` generator expression.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21181
std::ranges::to<>() has a little protocol with containers to
allow them to optimize their construction from ranges. Implement it
for small_vector. It optimizes ranges that can have their size determined
quickly, or that can be traversed twice to determine the size by reserving
up front. Single-pass ranges (std::ranges::input_range) use the less
efficient push_back method.
A unit test (which fails without the new constructor) is added.
Closesscylladb/scylladb#21094
This includes way too much, including <boost/regex.hpp>, which is huge.
Drop includes of adaptors.hpp and replace by what is needed.
Closesscylladb/scylladb#21187
now that we are able to use ranges library provided by the C++ standard library. there is no need to use the homebrew `ranges::to()`.
in this series, we
- switch to `std::ranges::to()` in favor of `ranges::to()`.
- and drop the unused `utils/ranges.hh` header file.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#21182
* github.com:scylladb/scylladb:
utils: remove unused ranges.hh
test/boost: stop using ranges::to()
when building the check-header target, we have following failure:
```
FAILED: CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o
/home/kefu/.local/bin/clang++ -DDEVEL -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Dev\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Dev/seastar/gen/include -isystem /usr/include/p11-kit-1 -isystem /home/kefu/dev/scylladb/abseil -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Wno-unused-const-variable -Wno-unused-function -Wno-unused-variable -std=c++23 -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -DWITH_GZFILEOP -MD -MT CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o -MF CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o.d -o CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o -c /home/kefu/dev/scylladb/build/check-headers/column_computation.hh.cc
/home/kefu/dev/scylladb/build/check-headers/column_computation.hh.cc:24:37: error: no template named 'unique_ptr' in namespace 'std'
24 | using column_computation_ptr = std::unique_ptr<column_computation>;
| ~~~~~^
/home/kefu/dev/scylladb/build/check-headers/column_computation.hh.cc:40:12: error: unknown type name 'column_computation_ptr'; did you mean 'column_computation'?
40 | static column_computation_ptr deserialize(bytes_view raw);
| ^~~~~~~~~~~~~~~~~~~~~~
| column_computation
```
it turns out we failed to include `<memory>`.
in this change, we include `<memory>` so that this header is
self-contained.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21185
When there are zero tablets, tablet_metadata::_balancing_enabled
is ignored in the copy.
The property not being preserved can result in balancer not
respecting user's wish to disable balancing when a replica is
created later on.
Fixes#21175.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#21177
now that we are able to use ranges library provided by the C++
standard library. there is no need to use the homebrew `ranges::to()`.
in this change, we switch to `std::ranges::to()` in favor of
`ranges::to()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 787ea4b1, we construct a new `storage_options` for each sstable
to be restored. the `location` of the new `storage_option` instances
is composed of the configured `prefix` and the dirname of each toc
component. but instead of separating them with "/", we just concatenate
them. this breaks the test if the specified key representing toc
components includes "dirname" in them.
in this change
- data_directory: instead of using "{prefix}{dirname}", we use
"{prefix}/{dirname}".
- test/object_store: update the existing test to add a suffix
in the keys of the toc objects to mimic the typical use case.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21170
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This will be used in a following patch to decide if the compacting
reader has to check the memtables before purging a tombstone.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.
After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.
Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2
Closesscylladb/scylladb#21058
* github.com:scylladb/scylladb:
test/boost/mutation_test: add test for multishard permit safety
test/lib/reader_lifecycle_policy: add semaphore factory to constructor
test/lib/reader_lifecycle_policy: rename factory_function
repair/row_level: drop now unneeded release_base_resource() calls
readers/multishard: make multishard reader safe to create with admitted permits
Fixes#18903
Adds a "transitional" internode encryption mode, under which all _outgoing_ RPC connections will use TLS, but we will still accept any incoming non-tls connection.
This allows an operator to perform a move to TLS RPC without cluster downtime:
1. For each server, add certificate etc options to server_encryption_options + internode_encryption=none + set ssl_storage_port + restart (rolling)
2. For each server, set internode_encryption=transitional + RR
3. For each server, set internode_encryption=all + RR
Closesscylladb/scylladb#18939
* github.com:scylladb/scylladb:
test::topology: Add test for TLS upgrade and downgrade of internode encryption
docs: Add internode_encryption=transitional documentation
messaging_service: Add "transitional" internode encryptipn mode
messaging_service: Create TLS connector even if internode_enc=none when certs set
before this change, scylla's CMake-based system consumes Seastar
library by including it directly. but this failed to address the needs
of linking against Seastar shared libraries in Debug and Dev builds, while
linking against the static libraries in other builds. because Seastar
uses `BUILD_SHARED_LIBS` CMake variable to determine if it builds
shared libraries. and we cannot assign different values to this
CMake variable based on current configure type -- CMake does not
support. see https://gitlab.kitware.com/cmake/cmake/-/issues/19467
in order to address this problem, we have a couple possible
solutions:
- to enable Seastar to build both shared and static libraries in a
pass. without sacrificing the performance, we have to build
all object files twice: once with -fPIC, once without. in order
to accompolish this goal, we need to develop a machinary to
populate the same settings to these two builds. this would
complicate the design of Seastar's building system further.
- to build Seastar libraries twice in scylla, we could use
the ExternalProject module to implement this. but it'd be
complicate to extract the compile options, and link options
previously populated by Seastar's targets with CMake --
we would have to replicate all of them in scylla. this is
out of the question.
- to build Seastar libraries twice before building scylla,
and let scylla to consume them using CMake config files or
.pc files. this is a compromise. it enables scylla to
drive the build of Seastar libraries and to consume
the compile options and link options. the downside is:
* the generated compilation database (compile_commands.json)
does not include the commands building Seastar anymore.
* the building system of scylla does not have finer graind
control on the building process of seastar. for instance,
we cannot specify the build dependency to a certain seastar
library, and just build it instead of building the whole
seastar project.
turns out the last approach is the best one we can have
at this moment. this is also the approach used by the existing
`configure.py`.
in this change, we
- add FindSeastar.cmake to
* detect the preconfigured Seastar builds, and
* extract the build options from .pc files
* expose library targets to be consumed by parent project
- add Seastar as an external project, so we can build it from
the parent project.
this is atypical compared to standard ExternalProject usage:
- Seastar's build system should already be configured at this point.
- We maintain separate project variants for each configuration type.
Benefits of this approach:
- Allows the parent project to consume the compile options exposed by
.pc file. as the compile options vary from one config to another.
- Allows application of config-specific settings
- Enables building Seastar within the parent project's build system
- Facilitates linking of artifacts with the external project target,
establishing proper dependencies between them
we will update `configure.py` to merge the compilation database
of scylla and seastar.
Refs scylladb/scylladb#2717
---
this is a CMake-related change, hence no need to backport.
Closesscylladb/scylladb#21131
* github.com:scylladb/scylladb:
build: cmake: use GENERATOR_IS_MULTI_CONFIG property to detect mult-config
build: cmake: consume Seastar using its .pc files
build: do not use `mode` as the index into `modes`
build: cmake: detect and link against GnuTLS library
build: cmake: detect and link against yaml-cpp
build: cmake: link Seastar with Seastar::<COMPONENT>
build: cmake: define CMake generate helper funcs in scylla
get_description.py was moved from a standalone script to a library.
During the transition, param_mapping was not included in the script
option.
This patch makes it possible to use the file as a standalone script
again.
Seastar now respect CMAKE_CXX_STANDARD in favor of Seastar_CXX_DIALECT,
which has been dropped in Seastar's commit of
60bc8603bd438232614e9b3dcd7537dc83c85206 .
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21130
Endpoints are registered next to the service they use, and the unregistration deferred action is created right after it. When registered, the service in question is passed as argument and then captured by enpoints lambdas. This makes sure that service is not used by endpoints after being stopped.
That's not so for commitlog endpoints. These are registered in several places, and /commitlog "function" is not unregistered on stop. This patch fixes some of this misbehavior, in particular:
- adds unregistration of commitlog API function
- uses sharded<database>& argument in endpoints instead of ctx.db
- moves some endpoints from storage_service.cc to commitlog.cc
Closesscylladb/scylladb#21053
* github.com:scylladb/scylladb:
api: Use captured database, not the one from ctx
api: Pass sharded<database> to commitlog endpoints registration
api: Move commitlog-related from storage_service.cc
api: Unset commitlog API endpoints
api: Extract set_server_commitlog() from set_server_done()
To reduce dependency load, change uses of boost ranges to std::ranges.
The first patch is preparation, replacing a construct that isn't easy to support with std ranges with something simpler.
No backport as this is a code cleanup.
Closesscylladb/scylladb#21122
* github.com:scylladb/scylladb:
schema: replace boost ranges with std ranges
schema: precompute all_columns_in_select_order()
The statistics_rewrite test case copies an sstable from resources two
times:
- first time -- explicitly by listing resource components and copying
files to the test temp dir
- second time -- implicitly, by calling create_links() linking copied
files by new set in the staging/ subdirectory
The 2nd step is not needed and the history of changes justifies that.
The test itself appeared with 70b793e4d3 and it only contained the 2nd
"copying" -- test linked files from resource directory and then worked
in the newly created set.
Later, commit 59c57861ae added the first step and copied the files
from resource into test temp dir. At this point linking copied files
because pointless, but was preserved. Let's remove it now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21097
when building the tree with Clang-20 and libstdc++ shippped with
GCC-14.2, we have following build failure:
```
/home/kefu/dev/scylladb/interval.hh:638:14: error: no member named 'sort' in namespace 'std'
638 | std::sort(intervals.begin(), intervals.end(), [&](auto&& r1, auto&& r2) {
| ~~~~~^
/home/kefu/dev/scylladb/interval.hh:691:21: error: no member named 'upper_bound' in namespace 'std'
691 | return std::upper_bound(r.begin(), r.end(), value, std::forward<LessComparator>(cmp));
| ~~~~~^
/home/kefu/dev/scylladb/interval.hh:723:18: error: no member named 'minmax' in namespace 'std'; did you mean 'fminmag'?
723 | auto p = std::minmax(_interval, other._interval, [&cmp] (auto&& a, auto&& b) {
| ^~~~~~~~~~~
| fminmag
```
it turns out we failed to include the used header.
in this change, we include `<algorithm>` so that this header is
self-contained.
after this change, the build passes.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21168
before this change, scylla's CMake-based system consumes Seastar
library by including it directly. but this failed to address the needs
of linking against Seastar shared libraries in Debug and Dev builds, while
linking against the static libraries in other builds. because Seastar
uses `BUILD_SHARED_LIBS` CMake variable to determine if it builds
shared libraries. and we cannot assign different values to this
CMake variable based on current configure type -- CMake does not
support. see https://gitlab.kitware.com/cmake/cmake/-/issues/19467
in order to address this problem, we have a couple possible
solutions:
- to enable Seastar to build both shared and static libraries in a
pass. without sacrificing the performance, we have to build
all object files twice: once with -fPIC, once without. in order
to accompolish this goal, we need to develop a machinary to
populate the same settings to these two builds. this would
complicate the design of Seastar's building system further.
- to build Seastar libraries twice in scylla, we could use
the ExternalProject module to implement this. but it'd be
complicate to extract the compile options, and link options
previously populated by Seastar's targets with CMake --
we would have to replicate all of them in scylla. this is
out of the question.
- to build Seastar libraries twice before building scylla,
and let scylla to consume them using CMake config files or
.pc files. this is a compromise. it enables scylla to
drive the build of Seastar libraries and to consume
the compile options and link options. the downside is:
* the generated compilation database (compile_commands.json)
does not include the commands building Seastar anymore.
* the building system of scylla does not have finer graind
control on the building process of seastar. for instance,
we cannot specify the build dependency to a certain seastar
library, and just build it instead of building the whole
seastar project.
turns out the last approach is the best one we can have
at this moment. this is also the approach used by the existing
`configure.py`.
in this change, we
- add FindSeastar.cmake to
* detect the preconfigured Seastar builds, and
* extract the build options from .pc files
* expose library targets to be consumed by parent project
- add Seastar as an external project, so we can build it from
the parent project. BUILD_AWAYS is set to ensure that Seastar is
rebuilt, as scylla developers are expected to modify Seastar
occasionally. since the change in Seastar's SOURCE_DIR is not
detectable via the ExternalProject, we have to rebuild it.
this is atypical compared to standard ExternalProject usage:
- Seastar's build system should already be configured at this point.
- We maintain separate project variants for each configuration type.
Benefits of this approach:
- Allows the parent project to consume the compile options exposed by
.pc file. as the compile options vary from one config to another.
- Allows application of config-specific settings
- Enables building Seastar within the parent project's build system
- Facilitates linking of artifacts with the external project target,
establishing proper dependencies between them
- preserve the existing machinery of including Seastar only when
building without multi-config generator. this allows users who don't
use mult-config generator to build Seastar in-the-tree. the typical
use case is the CI workflows performing the static analysis.
we will update `configure.py` to merge the compilation database
of scylla and seastar.
Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, in `configure_seastar()`, we use `mode` as
a component in the build directory, and use it as the index into `modes`
dict. but in a succeeding commit, we will reuse `configure_seastar()`
when preparing for the CMake-based building system, in which,
`mode` will be the CMake configure type, like "Debug" instead of
scylla's build mode, like "debug". to be prepared for this change,
let's use `mode_config` directly. it's identical to `modes[mode]`.
this also improves the readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in main.cc, we use yaml-cpp library directly. so we are obliged to
detect this library in scylla and link against it instead of relying
on other library to do this. currently, Seastar detects it and pulls
in yaml-cpp for us, but we should not take this for granted and rely
on this.
in this change, we detect and link against yaml-cpp to make this
dependency explicit.
the same applies to the "utils" library.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we link against the targets defined in Seastar's
source tree. but these targets are not part of Seastar's public
interface -- they are not exposed by Seastar's CMake config files.
so, let link against the target names qualified by the library module
name. this also prepares for the transition to using Seastar without
including it directly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we assume that scylla's CMake script includes
Seastar's CMake script.
but we are going to consume Seastar using its .pc files or its CMake
config files instead of including it directly. more over these helper
functions are not part of Seastar's public interface.
actually the same applies to the `check_headers()` helper, which was
adapted from seastar's CheckHeaders.cmake.
so to be prepared for this change, let's define these generate helper
functions in scylla.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.
Passing an admitted permit -- i.e. one with count resources on it -- to
the multishard reader, will possibly result in a deadlock, because the
permit of the multishard reader is destroyed after the permits of its
child readers. Therefore its semaphore resources won't be automatically
released until children acquire their own resources.
This creates a dependency (an edge in the "resource allocation graph"),
where the semaphore used by the multishard reader depends on the
semaphores used by children. When such dependencies create a cycle, and
permits are acquired by different reads in just the right order, a
deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of
course they aren't. This is small wonder, considering that not even the
documentation on the multishard reader mentions this problem.
To work around this, the user has to call
`reader_permit::release_base_resources()` on the permit, before passing
it to the multishard reader.
On multiple occasions, developers (including the very author of the
multishard reader), forgot or didn't know about this and this resulted
in deadlocks down the line.
This is a design-flaw of the multishard reader, which is addressed in
this patch, after which, it is safe to pass admitted or not admitted
permits to the multishard reader, it will handle the call to
`release_base_resources()` if needed.
Skip the advertisement of the group0 state id in case the gossiper is
not active (ready).
Sending the application state when the gossiper is not active caused
a warning being shown in the log about the local endpoint not being
found in the gossiper endpoint state map on a (graceful) node restart.
The local endpoint is initialized on the gossiper startup, so we skip
the state id advertisement until the startup is finished.
Fixes: scylladb/scylladb#21117
No backport: Fixes an issue that is currently only present in master
Closesscylladb/scylladb#21119
* github.com:scylladb/scylladb:
raft: consider the gossiper state then sending the group0 state id
raft: add the test for GROUP0_STATE_ID gossip application state
this change is created in the same spirit of 0104c7d3, which used
`std::map::contains()` in the place of `std::map::count()` when
checking for the existence of a paramter with given name for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21158
implement cassandra original schema option memtable_flush_period_in_ms:
Milliseconds before memtables associated with the table are flushed.
there are few things concerning this patch:
* milliseconds look strange and scary for this option. Unlike Cassandra
we use 60000ms (1min) minimum value for this option.
* This is limitation of Cassandra but it is impossible to set this option
for system tables. However sometimes it could be very useful to use
automatic flushing for such a tables: some system tables have small
traffic and as a result prevent tombstone garbage collection.
Fixes#20270Closesscylladb/scylladb#20999
This commit removes the raw:: html directive (with the exception of an embedded animation) because:
- It is not supported by the dark theme and looks bad.
- It's a legacy directive, and we no longer need it on index pages.
Fixes https://github.com/scylladb/scylladb/issues/20881Closesscylladb/scylladb#21062
before this change, if no positional arguments are passed to "restore"
subcommand, the tool fails with following error message:
```
error running operation: boost::wrapexcept<boost::bad_any_cast> (boost::bad_any_cast: failed conversion using boost::any_cast)
```
this is difficult to digest.
after this change, if no sstables are specified:
```
error processing arguments: missing required parameter: sstables
```
this is slightly better from user experience's perspective.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21136
`typeof` is a GNU extension, and is part of C23, but it is not included
by C++23.
if we compile the tree with c++23 instead of gnu++23, the compilation
fails like:
```
FAILED: repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o
/home/kefu/.local/bin/clang++ -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -std=c++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DFMT_SHARED -DWITH_GZFILEOP -MD -MT repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o -MF repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o.d -o repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o -c /home/kefu/dev/scylladb/repair/repair.cc
In file included from /home/kefu/dev/scylladb/repair/repair.cc:21:
In file included from /home/kefu/dev/scylladb/service/storage_service.hh:19:
In file included from /home/kefu/dev/scylladb/service/qos/service_level_controller.hh:19:
In file included from /home/kefu/dev/scylladb/auth/service.hh:23:
In file included from /home/kefu/dev/scylladb/auth/permissions_cache.hh:22:
/home/kefu/dev/scylladb/utils/loading_cache.hh:754:66: error: use of undeclared identifier 'typeof'; did you mean 'typeid'?
754 | static_assert(SectionHitThreshold <= std::numeric_limits<typeof(_touch_count)>::max() / 2, "SectionHitThreshold value is too big");
| ^
/home/kefu/dev/scylladb/utils/loading_cache.hh:754:66: error: template argument for template type parameter must be a type
754 | static_assert(SectionHitThreshold <= std::numeric_limits<typeof(_touch_count)>::max() / 2, "SectionHitThreshold value is too big");
| ^~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/limits:311:21: note: template parameter is declared here
311 | template<typename _Tp>
| ^
2 errors generated.
```
in this change, we trade `typeof` for a more standard compliant
`decltype`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21116
To reduce dependency load, replace use of boost ranges with std ranges.
Since std ranges are more particular about what iterators they accept, a custom iterator
in size_estimates_virtual_reader has to be fixed first.
No backport; code cleanup.
Closesscylladb/scylladb#21143
* github.com:scylladb/scylladb:
interval: change boost ranges to std ranges
size_estimates_virtual_reader: make virtual_row_iterator more conforming
The stop assertion check in the group0 state id handler was triggering
under some circumstances (stopping server during restart). In that case
it might be that the stop is initiated before the server is fully
initialized, and then the handler destructor is being called without
calling to the `stop()` method first. This is a valid scenario.
The whole `stop()` in the group0 state id handler is not necessary,
as the only operation being done is cancelling the timer which is done
by the timer destructor automatically anyway.
There is the concern of a currently running timer callback, but it
doesn't preempt (not async) so the timer shouldn't be destroyed before
the callback finishes.
Fixes: scylladb/scylladb#21074Closesscylladb/scylladb#21127
Skip the advertisement of the group0 state id in case the gossiper is
not active (ready).
Sending the application state when the gossiper is not active caused
a warning being shown in the log about the local endpoint not being
found in the gossiper endpoint state map on a (graceful) node restart.
The local endpoint is initialized on the gossiper startup, so we skip
the state id advertisement until the startup is finished.
Fixes: scylladb/scylladb#21117
Test that the GROUP0_STATE_ID gossip application state is not causing
the "endpoint_state_map does not contain endpoint" error.
Refs: scylladb/scylladb#21117
Fixes#21150
Apparently, on some CI, in debug, these tests can time out (large alloc)
without actually failing what they do. Up the timeout (could consider removing
as well, but...) so they hopefully pass.
Closesscylladb/scylladb#21151
- `docs/Makefile`: work around python-poetry issue https://github.com/python-poetry/poetry/issues/8761
- `docs/README.md`: fix minimum poetry version
No backporting needed (docs development).
Closes scylladb/scylladb#21118
* github.com:scylladb/scylladb:
docs/README.md: fix minimum poetry version
docs/Makefile: work around python-poetry issue #8761
This patch adds test/cql-pytest tests which verify that all CQL operations
that shouldn't be allowed on a materialized view, actually aren't:
* All operations writing to a table - INSERT, UPDATE, BATCH, DELETE,
and TRUNCATE - should be rejected when asked to operate on a view.
* All operations with "TABLE" in their name (DROP TABLE, ALTER TABLE,
DESC TABLE) should be rejected on a view - the ".. MATERIALIZED VIEW"
operation should be used instead.
* A materialized view cannot get materialized views or indexes of its
own.
All tests pass on Cassandra (Cassandra 4 or above is needed for the
"DESC" test), and all but one pass on Scylla - Scylla does allow
"DESC TABLE" on a materialized view, unlike Cassandra. I opened an
issue to track that difference: Refs #21026
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21028
To work with std::ranges, an iterator has to have a default constructor,
and be assignable. Add the default constructor and convert references
to pointers to support this.
This method spawns part uploading in the background, but still may
throw, e.g. preparing http request or claiming memory. In this case any
outstanding part upload fibers are not waited on, and the whole
do_upload_file object can be freed from under their feet. Also, the
multipart upload is not aborted, thus losing track of it until g.c.
happens.
To fix it, catch any exception from upload_part() too, and if it
happens, do what the regular upload_sink would do -- close the gate thus
picking up any outstanding activity that may happen there and abort the
multipart upload.
Indentation is deliberately left broken
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In this patch we add to docs/new-apis.md (Alternator-specific API)
a description of the service discovery HTTP requests - `/` and
`/localnodes` that was previously not documented except in a design
document that is unfortunately no longer available publically.
The description also includes the recently added `dc` and `rack`
parameters for the `/localnodes` request.
Fixes#20989
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, the documentation of Alternator-specific APIs (APIs
which are unique to Alternator and don't exist in DynamoDB) appear as
a section of the main document alternator.md. In the next patch we
want to describe yet another Alternator feature and make this section
even longer. But there is growing sentiment that the Alternator
documentation should be split into more, shorter, pages (Refs #19822)
so this patch splits the Alternator-specific API documentation into a
new file, new-apis.md.
There is no new content in the patch - just movement of existing content
plus a reference to the new page.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Standardize on a single range library.
The changes are mostly mechanical. The only exception is boost::join,
which has no analog in std::ranges (rightly so, since it cannot be
implemented efficiently). A variety of tricks were used to convert it:
- use std::ranges::join() on an std::array of std::span (when the
inputs were all contiguous)
- copy to a utils::small_vector (when it is expected that there will
be no allocation)
- use a small_vector of pointers and iterate+dereference that
Closesscylladb/scylladb#21082
To reduce dependency load, use std ranges instead of boost ranges.
The std::ranges::{lower,upper}_bound don't support heterogeneous lookup,
but a more natural solution is to use a projection to search for the name,
so we use that and the custom comparator is removed.
Many callers are converted as well due to poor interoperability between
boost ranges and std ranges.
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.
However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.
In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.
Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.
Fixes: scylladb/scylladb#21060Closesscylladb/scylladb#21091
This change reorganizes the way standard_role_manager startup is handled: role_manager::ensure_superuser_is_created() is added, which returns a future that resolves once the superuser is available. We wait for this future before starting the CQL server.
There is a change in behavior auth::do_after_system_ready is potentially an infinite loop, and we await its result.
Fixes#10481
Reason for no backports: it's not a regresson and it's an issue that may only affect a tiny time window during the cluster startup.
Closesscylladb/scylladb#20137
* github.com:scylladb/scylladb:
test: test_restart_cluster: create the test
auth: standard_role_manager allows awaiting superuser creation
auth: coroutinize the standard_role_manager start() function
auth: don't start server until the superuser is created
all_columns_in_select_order() returns a complicated boost range type
that has no analog in std::ranges. To ease the transition to std::ranges,
precompute most of the work done in that function, and only convert
pointers to references in the function itself.
Since boost ranges and std::ranges don't fully interoperate, one of
the user has to be adjusted.
Commit 2a3012db7f ("docs/README.md: expand prerequisites list",
2022-08-31) referenced poetry release 1.12, which does not exist even
today (as of this writing, the latest release is 1.8.4). The intent was
probably 1.1.12.
Copy the minimum version from "sphinx-scylladb-theme": 1.8.1 (see
"docs/source/getting-started/installation.rst" and
"docs/source/getting-started/quickstart.rst" at commit f7c26b422572).
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Python-poetry is affected by bug
<https://github.com/python-poetry/poetry/issues/8761>. Namely, if you have
"keyring" <https://pypi.org/project/keyring/> installed, poetry will try
to gain access to the Default collection in the (ex. GNOME) keyring, even
if poetry only needs read-only access to package repositories, and even if
those repos are public.
Consequently, you either unlock your Default collection for poetry
(unjustifiedly), or your GUI session gets effectively locked up, because
any time you hit Cancel on the keyring unlock dialog, poetry immediately
pops up another, and this dialog grabs the keyboard -- you cannot even
switch to a character VT, for killing poetry; you have to log in via ssh
for that.
This issue is not visible to users who don't use "keyring" (GNOME or
otherwise). For those who do, work around the problem by selecting the
"null" keyring back-end, in the environment of every poetry invocation.
Note: I have not regression-tested the workaround in a desktop environment
where "keyring" is unavailable to begin with.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Endpoints are registered next to the service they use, and the unregistration deferred action is created right after it. When registered, the service in question is passed as argument and then captured by enpoints lambdas. This makes sure that service is not used by endpoints after being stopped.
That's not quite the case for compaction manager. Its endpoints can be registered in several places, and compaction_manager "function" is not unregistered on stop. This patch fixes some of this misbehavior, in particular:
- adds unregistration of compaction_manager API function
- uses sharded<compaction_manager>& argument in endpoints instead of ctx.db.local().get_compaction_manager() chain
- moves some endpoints from storage_service.cc to compaction_manager.cc
Closesscylladb/scylladb#20962
* github.com:scylladb/scylladb:
api: Use captured compaction_manager in get_cm_stats() helper
api: Use captured compaction_manager in endpoints
api: Add sharded<compaction_manager> argument to compaction_manager API reg/unreg
api: Move some endpoints from storage_service.cc to compaction_manager.cc
api: Unset compaction_manager endpoints
api: Use shorter registration method for compaction_manager function
* seastar 3c9c2696...abd20efd (44):
> Revert "build: enable Seastar to build shared and static libs in a single build"
> dns: Support c-ares before 1.22
> build: improve c-ares version extraction method
> Minor typos fix in doc: reference_wrapper.hh
> build: enable Seastar to build shared and static libs in a single build
> build: include -fno-semantic-interposition in CXXFLAGS
> loop: add Sentinel iterator support to parallel_for_each()
> dns: use ARES_LIB_INIT_NONE instead of a magic number
> dns: use struct typedef for `_channel`
> doc/testing.md: explain seastar + boost test colocation
> build: extract c-ares version from header file
> dns: replace deprecated ares_process() with ares_process_fd()
> build: do not support c-ares >= 1.33
> http: fix indentation
> http: Add non-owning `make_request` to http client
> treewide: replace boost::irange with std::views::iota where possible
> Added unit test for "http_content_length_data_sink_impl"
> sharded.hh: migrate to concepts
> file, scheduling: remove non-unified I/O and CPU scheduling
> http: Add more HTTP response codes
> http: add constness to `response_line`
> http: refactor response_line to use `seastar::format`
> httpd/file_handler: Always close stream
> build: add compiler and C++ standard compatibility checks
> rpc: rpc_types: replace boost::any with std::any
> tls: drop dependency on boost::any
> rpc: drop unnecessaty includes to boost libraries
> rpc: compressor factory: deinline some boost-using functions
> sharded: replace boost ranges with <ranges>
> scheduling_specific: drop dependency on boost range adaptors
> prefetch: drop dependency on boost::mpl
> resource: drop unused dependency on boost::any
> smp: drop dependency on boost ranges
> reactor: remove unnecessary boost includes
> execution_stage: remove unnecessary boost includes
> sharded.hh: add invoke_on variant for a shard range
> shared_ptr: remove deprecated lw_shared_ptr assignment operator
> seastar-addr2line: add --debug arg
> addr2line: add type checking
> warnings: fix unused result warnings
> thread_pool: fix includes
> signal: remove trailing spaces
> tests/unit: chmod -x signal_test.cc
> iostream/http: Fix output_stream::write(temporary_buffer) overload
Closesscylladb/scylladb#21109
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.
fixes#20043Closesscylladb/scylladb#21020
* github.com:scylladb/scylladb:
tablets: Validate system.tablets update
group0_client: Introduce change validation
group0_client: Add shared_token_metadata dependency
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.
Fixes: scylladb/scylladb#21034
Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.
Closesscylladb/scylladb#21056
aiohttp 3.10.5 complains when 'unix+http' is used for a unix-domain
socket. USe 'http', which work with 3.10.5 and the toolchain's 3.9.5.
Closesscylladb/scylladb#21080
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.
If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.
However, the check assumes that the RELEASE is in the format
<build identifier>.<date>.<commit hash>
and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.
Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.
Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.
Fixes: scylladb/scylladb#21027Closesscylladb/scylladb#21049
Test a rolling upgrade of cluster while active.
Note: This is a unit test version of dtest test. Has the big drawback of not
being able to use cassandra-stress to work and verify the cluster and results
Test moves from none to all to none encryption while writing and then checking
written data.
Fixes#18903
Adds a "transitional" internode encryption mode, under which all
_outgoing_ RPC connections will use TLS, but we will still accept
any incoming non-tls connection.
This allows an operator to perform a move to TLS RPC without cluster
downtime:
1. For each server, add certificate etc options to
server_encryption_options
+ internode_encryption=none
+ set ssl_storage_port
+ restart (rolling)
2. For each server, set internode_encryption=transitional + RR
3. For each server, set internode_encryption=all + RR
Refs #18903
If ssl_storage_port is non-zero _and_ we have specified actual certificates
are set/exists, create TLS connector for RPC regardless of whether internode
encryption is enables. I.e. potentially unused. For transitioning cluster to
TLS.
The system.sstables (a.k.a. sstables registry) primary key is "string location" as partition key and "uuid generation" as clustering one. The "location" part was taken from table.config.datadir value which, in turn, a string containing path to on-disk files if the table was located locally, e.g. /var/lib/scylla/data/ks/cf-abc123 one. Recently [1] the datadir was moved from table config onto storage options, but this string is still used as registry key.
Other than being owned by a table with ID, sstables are accessed by restore-from-object-storage code [2]. To make it work, both storage driver and sstable_directory helper class maintain two formats of object prefixes for sstables components. For S3-backed sstables having a record in registry, the path used is s3://bucket/generation/component. For restore code there are user-provided prefixes that do not match the aforementioned pattern. The selection between those two is now made by checking sstable state, which is not obvious and may cause troubles for tiered storage driver.
This patch changes the registry schema so that partition key becomes "uuid owner" and is set to be table.id() value. This is to stop using the local path by S3 backed sstables. Also this change makes it possible for storage driver and sstable directory to rely on the storage options only to tell different bucket prefixes formats from each other.
As a side effect, the make_s3_object_name() helper, that generates the proper object name, becomes explicit for restore-from-S3 usage. Now it relies on the sstable::filename() calling this->prefix() behind the scenes and the latter to return the user-provided prefix, which is pretty fragile construction.
No need to backport (and it's not going to be easy to do it), storage options feature is still experimental
Refs #20675 [1]
Refs #20305 [2]
Closesscylladb/scylladb#20998
* github.com:scylladb/scylladb:
sstables: Flatten S3 object name making
sstable_directory: Flatten directory lister creation
treewide: Rename sstable registry location field to be owner
system_keyspace: Change sstables registry partition key type
sstables: Keep location variant on s3 backend too
storage_options: Use variant on S3 options
sstables: Split sstable::filename() helper
sstables: Add s3_storage::owner() helper
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```
in this change, we redistribute `addr2line.py` as well. this
should address the issue above.
Fixesscylladb/scylladb#21077
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21078
When we made the raft-based topology mandatory, all boost test
tests started using it. Then, `test_read_required_hosts` started
failing. We left investigating it for later and started running it
with `force-gossip-topology-changes` to make it pass.
Currently, the test doesn't fail with the raft-based topology
anymore. Hence, we remove the FIXME and run the test with a normal
config.
We don't know when and why the test stopped failing. Investigating
it wouldn't be easy, since we don't even know why it failed in the
first place. We suspect that there was some bug that is now fixed.
This patch only fixes a test, there is no need to backport it.
Fixesscylladb/scylladb#18463Closesscylladb/scylladb#20960
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in filter_for_query(): the map is considered incorrect if the list of replicas contains a node from a data center whose replication factor is 0.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282; it only adds condition checks to prevent undefined behavior in cases of inconsistent inputs.
Refs scylladb/scylladb#20625
As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.
Closesscylladb/scylladb#20851
* github.com:scylladb/scylladb:
Add conditions checking for get_read_executor
Avoid an extra call to block_for in db::filter_for_query.
Improve code readability in consistency_level.cc and storage_proxy.cc
tools: Add build_info header with functions providing build type information
tests: Add tests for alter table with RF=1 to RF=0
The purpose of this test that the cluster is able to boot up again after
a full cluster shutdown, thus exhibiting no issues when connecting to
raft group 0 that is larger than one.
This change implements the ability to await superuser creation in the
function ensure_superuser_is_created(). This means that Scylla will not
be serving CQL connections until the superuser is created.
Fixes#10481
This change reorganizes the way standard_role_manager startup is
handled: now the future returned by its start() function can be used to
determine when startup has finished. We use this future to ensure the
startup is finished prior to starting the CQL server.
Some clusters are created without auth, and auth is added later. The
first node to recognize that auth is needed must create the superuser.
Currently this is always on restart, but if we were to ever make it
LiveUpdate then it would not be on restart.
This suggests that we don't really need to wait during restart.
This is a preparatory commit, laying ground for implementation of a
start() function that waits for the superuser to be created. The default
implementation returns a ready future, which makes no change in the code
behavior.
The s3_storage backend driver has a method that generates object path
within the bucket. Depending on options alternative it picks one of two
formats:
- for string prefix, it uses it implicitly via sstable::filename() call
that calls storage->prefix() which, in turn, returns prefix value
- for registry-backed sstables, the /bucket/generation/component path is
generated
This patch bruses this place up. Similarly to previous patch, this
change also makes the selection based on the location alternative, not
on the sstable state. As well it's idempotent change, as S3 sstables
with 'upload' state only appear when restoring from object store, and in
this case the string location is in use.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patchin, the way components lister is created for S3
storage options became quite hairy. This patch brushes things up to be
easier to read.
The only "functional" change here, is that selection between registry
lister and S3 lister is made based on options' location held
alternative, not on the sstable state value. That's in fact idempotent
change, the only caller that provides string location on options is the
"restore from object store" code that also sets state to be 'upload'.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is sort of continuation of the previous patch. The partition key in
the registry is now table_id, not string, and is better called "owner",
not "location". This patch is s/location/owner/ over specific places
that include field name in the schema, argument names in registry
maintenance classes and tests accessing the selected row fields by name.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today, the system.sstables schema uses string as partition key. Callers,
in turn, use table's datadir value to reference entries in it. That's
wrong, S3-backed sstables don't have any local paths to work with. The
table's ID is better in this role.
This patch only changes the field type to be table_id and fixes the
callers to provide one. In particular, see init_table_storage() change
-- instead of generating a datadir string, it sets table.id() as the
options' location. Other fixed places are tests. Internally, this id
value is propagated via s3_storage::owner() method, that's fixed as
well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Enables debugging inside pytest subprocesses as well. It seems that pydev automatically attaches itself also to all python subprocesses. Since we used to call "pytest" wrapper it was deemed a different program, and we could not debug individual tests.
Closesscylladb/scylladb#21050
Previous patch put variant<string, table_id> as location of S3 options.
This patch makes the S3 sstables backend driver keep variant as sstable
location. As with the previous patch, driver only keeps variant, but
continues using its string alternative internally. This will be changed
later on.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Describing S3 storage for an sstables nowadays has two options -- via
sstables registry entry and by using the direct prefix string. The
former is used when putting a keyspace on S3. In this case each sstable
has the corresponding entry in the system.sstables table. The latter is
used by "restore from object storage" code. In that case, sstables don't
have entries in the registry, but are accessed by a specific S3 object
path.
This patch reflects this difference by making s3_options::location be
variant of string prefix and table_id owner. The owner needs more
explanation, here it is.
Today, the system.sstables schema defines partition key to be "string
location" and clustering key to be "UUID generation". The partition key
is table's datadir string, but it's wrong to use it this way. Next
patches will change the partition key to be table's ID (there's table_id
type for it), and before doing it storage options must be prepared to
carry it onboard. This patch does it, but the table_id alternative of
the location is still unused, the rest of the code keeps using the
string location to reference a row in the registry table. Next patches
will eventually make use of the table_id value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the gossip state for broadcasting the nodes state_id.
Implemented the Group0 state broadcaster (based on the gossip) that will broadcast the state id of each node and check the minimal state id for the tombstone GC.
When there is a change in the tombstone GC minimal state id, the state broadcaster will update the tombstone GC time for the group0-managed tables.
The main component of the change is the newly added `group0_state_id_handler` that keeps track, broadcasts and receives the last group0 state_ids across all nodes and sets the tombstone GC deletion time accordingly:
* on each group0 change applied, the state_id handler broadcasts the state_id as a gossip state (only if the value has changed)
* the handler checks for the node state ids every refresh period (configurable, 1h by default)
* on every check, the handler figures out the lowest state_id (timeuuid), which is state_id that all of the nodes already have
* the timestamp of this minimum state_id is then used to set the tombstone GC deletion time
* the tombstone GC calculation then uses that deletion time to provide the GC time back to the callers, e.g. when doing the compaction
* (as the time for tombstone GC calculation has the 1s granularity we actually deduce 1s from the determined timestamp, because it can happen that there were some newer mutations received in the same second that were not distributed across the nodes yet)
This change introduces a new flag to the static schema descriptor (`is_group0_table`) that is being checked for this newly added mode in the tombstone GC. We also add a check (in non-release builds only) on every group0 modification that the table has this flag set.
The group0 tombstone GC handling is similar to the "repair" tombstone GC mode in a sense (that the tombstone GC time is determined according to a reconciliation action), however it is not explicitly visible to (nor editable by) the user. And also the tombstone GC calculation is much simpler than the "repair" mode calculation - for example, we always use the whole range (as opposed to the "repair" mode that can have specific repair times set for specific ranges).
We use the group0 configuration to determine the set of nodes (both current and previous in case of joint configuration) - we need to make sure that we account for all the group0 nodes (if any node didn't provide the state_id yet, the current check round will be skipped, i.e. no GC will be done until all known nodes provide their state_id timestamp value).
Also note that the group0 state_id handling works on all nodes independently, i.e. each node might have its own (possibly different) state depending on the gossip application state propagation. This is however not a problem, as some nodes might be behind, but they will catch up eventually, and this solution has the benefit of being distributed (as opposed to having a central point to handle the state, like for example the topology coordinator that has been considered in the early stages of the design).
Fixes: scylladb/scylla#15607
New feature, should not be backported.
Closesscylladb/scylladb#20394
* github.com:scylladb/scylladb:
raft: add the check for the group0 tables
raft: fast tombstone GC for group0-managed tables
tombstone_gc: refactor the repair map
raft: flag the group0-managed tables
gossip: broadcast the group0 state id
raft/test: add test for the group0 tombstone GC
treewide: code cleanup and refactoring
To have the filename(type, prefix) one, next patches will provide prefix
on their own, to avoid storage->prefix() call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This driver uses sstring _location as part of the lookup key in the
sstables registry. Next patches will need to change that and put more
checks on the registry access, so introduce a helper method beforehand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
get_endpoints_for_reading(): the map is considered incorrect the number of
read replica nodes is higher than replication factor. The check is
applied only when built in non release mode.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.
Refs scylladb/scylladb#20625
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.
Test for scylladb/scylladb#20625
When generating readers for the set of sstables, the end size of this
vector is known in advance and its storage can be reserved.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21055
Continuation of the previous patch -- not commitlog-related endpoints
can use provided database reference, that was captured from main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It registers itself in /storage_service function, but works with
commitlog, so should be located next to commitlog endpoints.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of other set_...()-s has the unset_...() scheduled right
afterwards, so here's one for set_server_commitlog().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter collects a bunch of endpoints including commitlog ones.
Extract it as snandalone call in main. It's currently not located next
to "commitlog server" as it should, because there's no standalone
commitlog service in main. It will be addressed as a followup together
with other endpoints that work with sharded<database>.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.
If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema module (everything in schema/) is supposed to be towards the
leafs in the ScyllaDB inter-module dependency graph. In other words, it
should not depend on many other modules. On the other hand, almost the
entire codebase depends on the schema module itself.
Currently there is a circular dependency between schema and
replica::database, as the latter is a required argument for
schema::describe(). This is bad, not just because of the dependency mess
it introduces, but also because now schema::describe() can only be used
by code which has a reference to the database handy.
This patch breaks this circular dependency, by introducing the
schema_describe_helper interface and providing an implementation for it
in database.hh.
There is another circular dependency: schema <-> replica::table. This is
not addressed by this patch.
Closesscylladb/scylladb#20893
Use clear_gently to avoid the following stalls.
```
~frozen_mutation_fragment at ././frozen_mutation.hh:268
std::destroy_at<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_construct.h:88
std::allocator_traits<std::allocator<std::_List_node<frozen_mutation_fragment>
> >::destroy<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:537
std::__cxx11::_List_base<frozen_mutation_fragment,
std::allocator<frozen_mutation_fragment> >::_M_clear at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/list.tcc:77
~_List_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_list.h:499
~partition_key_and_mutation_fragments at ././repair/repair.hh:298
~repair_row_on_wire_with_cmd at ././repair/repair.hh:335
operator() at ./repair/row_level.cc:1881
```
Fixes#21016
Performance improvement only. No backport.
Closesscylladb/scylladb#21017
* github.com:scylladb/scylladb:
repair: Fix stall in repair_get_row_diff_with_rpc_stream_process_op_slow_path
repair: Add clear_gently for partition_key_and_mutation_fragments
Keep a copy of the sstable uuid generation in a new
scylla_metadata sstable_identifier attribute.
If the SSTable happens to have a numerical generation
just create a new time-uuid and log a message about that.
Dump this new attribute in scylla sstable dump tool.
And add a unit test to verify that the written (and then
loaded) sstable identifier matches the sstable's generation.
The motivatrion for this change stems from backup
deduplication. In essence, an sstable may already have been
backed up in a previous snapshot, and we don't want to
abck it up again if it's already present on external storage.
Today this is based on rclone that compares files checksums,
but once scylla will backup the sstables using the native
object-storage stack (#19890), we would like to use the sstable
globally-unique identifier for deduplication. Although the
uuid-generation is encoded in the sstable path, the latter
may change, e.g. due to intra-node migration, so keep a copy
of the original unique identifier in scylla-metadata, and that
attribute would survive file-based or intra-node migrations.
Fixesscylladb/scylladb#20459
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21002
When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:
1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1
If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.
This change fixes this problem by disabling tombstone garbage collection for pending replicas.
This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.
This change has to be backported to all active versions: 6.0, 6.1 and 6.2, as well as Enterprise 2024.2
Closesscylladb/scylladb#20788
* github.com:scylladb/scylladb:
test: test tombstone GC disabled on pending replica
tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
database::table: add tombstone_gc_enabled(locator::tablet_id)
This PR builds upon the PR for checksum validation (#20207) to further enhance scrub's corruption detection capabilities by validating digests as well. The digest (full checksum) is the checksum over the entire data, as opposed to per-chunk checksums which apply to individual chunks. Until now, digests were not examined on any code paths. This PR integrates digest checking into the compressed/checksummed data sources as an optional feature and enables it only through the validation path of the sstable layer (`sstable::validate()`). The validation path is used by the following tools:
* scrub in validate mode
* `sstable validate`
All other reads, including normal user reads, are unaffected by this change.
The PR consists of:
* Extensions to the compressed and checksummed data sources to support digest checking. The data sources receive the expected digest as a parameter and calculate the actual digest incrementally across multiple get() calls. The check happens on the get() call that reaches EOF and results to an exception if the digest is invalid. A digest check requires reading the whole file range. Therefore, a partial read or skip() is treated as an internal error.
* A new shareable digest component loaded on demand by the validation code. No lifecycle management.
* Grouping of old scrub/validate tests for compressed and uncompressed SSTables to reduce code duplication.
* scrub/validate tests for SSTables with valid checksums but invalid digests, and SSTables with no digests at all.
* scrub/validate tests with 3.x Cassandra SSTables to ensure compatibility.
Refs #19058.
New feature, no backport is needed.
Closesscylladb/scylladb#20720
* github.com:scylladb/scylladb:
test: Test scrub/validate with SSTables from Cassandra
compaction: Make quarantine optional for perform_sstable_scrub()
test: Make random schema optional in scrub_test_framework
test: Add tests for invalid digests
test: Merge scrub/validate tests for compressed and uncompressed cases
sstables: Verify digests on validation path
sstables: Check if digest component exists
sstables: Add digest in the SSTable components
sstables: Add digest check in compressed data source
sstables: Add digest check in checksummed data source
The test/cql-ptest/run-cassandra prefers to use Java 11 if installed on
the system because this is the only version of Java that all modern
versions of Cassandra run on (Cassandra 3 and 4 can run on Java 8 and 11,
Cassandra 5 can run on Java 11 and 17).
However, in our search order we tried the "java" in the user's path
first, before trying Java 11. This means that if the user for some
reason had the ancient Java 8 (which is now a decade old) as his
default "java" got that, instead of Java 11, and couldn't run Cassandra 5.
While at it, update the comments to reflect the new reality that
Cassandra 5 needs Java 17 or 11 - *not* 11 or 8 as the older Cassandra.
We should eventually change the code logic as well (searching for
versions that depend on the Cassandra version - not always Java 8 and
11), but let's do it later. This patch already fixes a real bug for
developers that did install Java 11 but their default "java" pointed to
Java 8.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21001
It was not possible to link to configuration parameters groups in docs/reference/configuration-parameters.rst if they contained a space.
Closesscylladb/scylladb#21018
The process_one_row() evaluates pending_replica by subtracting replicas
from new_replicas. There's a convenience helper for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21019
To avoid depending on two similar libraries (boost ranges and std \<ranges), replace
uses of the former with the latter. This series tackles the utils/ directory.
Code cleanup, no backport.
Closesscylladb/scylladb#20997
* github.com:scylladb/scylladb:
utils: logalloc: replace boost with std
utils: lsa: chunked_managed_vector: replace boost with std
utils: config_file: replace boost with std
utils: loading_cache: replace boost with std
utils: fragment_range: replace boost with std
utils: error_injector: replace boost with std
utils: crc: replace boost for_each with built-in range for
utils: class_registrator: replace boost with std
utils: chunked_vector: replace boost with std
utils: observable: replace boost with std
There's a long-pending issue in distributed loader. When it populates sstables on boot it loops over table.config.all_datadirs, but ignores the loop cursor (the datadir itslef), instead loading sstables from table.config.dir, which is 0th element of all_datadirs. There's a test for that, but it's also broken. Effectively collection happens from table.config.dir several times. For local sstables that's just wasted work and potentially lost sstables (but nobody seems to configure more than 1 datadir anyway). For S3 sstables it's also wasted work and incorrectness.
The fix is for both -- populator and test. The former is to use all_datadirs to construct sstable_directory. To make it happen, creation of sstable_directory now depends on the storage options, the loop is moved into the branch that creates sstable_directory for local storage type. The test fix is to make sure that some sstables in non-default datadir before running population code.
Closesscylladb/scylladb#20819
* github.com:scylladb/scylladb:
test: Fix test_multiple_data_dirs
distributed_loader: Indentation fix after previous patch
distributed_loader: Use correct datadir to collect local sstable
distributed_loader: Move all-datadirs loop to local storage collecting
distributed_loader: Collect table subdirs based on its storage options
distributed_loader: Indentation fix after previous patch
distributed_loader: Squash loop of collect_subdir into one method
distributed_loader: Convert map of directories into a vector
distributed_loader: Make start_subdir() method work with directory
distributed_loader: Drop local reference variable
distributed_loader: Split start_subdir()
distributed_loader: Remove allow-offstrategy argument
distributed_loader: Make populate() method work with directory
distributed_loader: Remove check for sstable_directory presense
distributed_loader: Out-line table_populator() methods
distributed_loader: Print storage options, not datadir
distributed_loader: Print prepared message
sstable_directory: Add sstable_state argument ot one of constructors
sstable_directory: Add state() method
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.
This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.
Closesscylladb/scylladb#20714
During shutdown, the compaction_manager starts stopping ongoing
compaction tasks through `really_do_stop()` method as soon as it
receives a signal from the abort source. Later, when the database object
shuts down, it calls `compaction_manager::drain` to ensure that all
compaction tasks have stopped. However, `compaction_manager::drain` is
currently implemented in such a way that, during shutdown, it
effectively becomes a no-op because the compaction_manager has already
initiated the stopping of tasks. As a result the caller assumes that all
the compaction tasks have stopped and proceeds to close all the tables.
This can lead to race conditions where table closures overlap with
compaction tasks that are still running, resulting in exceptions like :
```
exception during mutation write to 127.0.0.1:
utils::internal::nested_exception<std::runtime_error> (Could not write
mutation system:compaction_history
(pk{0010b70d31705e0411efb2edf6467f094c8b}) to commitlog):
seastar::gate_closed_exception (gate closed)
```
This commit fixes the issue by updating `compaction_manager::drain` to
invoke `stop_ongoing_compactions` even during shutdown to ensure that it
waits for the ongoing compaction tasks to complete. The
`stop_ongoing_compactions` method will also send a stop request to these
tasks before waiting, but the request will be ignored by the tasks as
they would have already received one earlier from `really_do_stop()`.
Fixes#20197
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20715
fixes#20517
Adds `aws_error` which possibly can contain errors from the S3 response body. Adds to the multipart upload completion a check for possible error and issues a retry if the error is retryable
Closesscylladb/scylladb#20518
* github.com:scylladb/scylladb:
test: add complete_multipart_upload completion tests
code: s3 client error handling
code: add response parsing and error handling to the complete_multipart_upload
code: Introduce AWS errors parsing
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.
Fixes: #20240Closesscylladb/scylladb#21007
As part of the effort to standardize on a single range library, convert the unconst helper
and its only user to \<ranges>.
The only user, mutation_partitions, happens to use intrusive_btree::iterator as the payload. That
iterator wasn't fully conform to iterator requirements, so it's fixed in a preliminary patch.
Code cleanup; no backport.
Closesscylladb/scylladb#20986
* github.com:scylladb/scylladb:
utils/unconst, mutation_partition: switch to ranges
utils: intrusive_btree: improve conformity with iterator requirements
Use clear_gently to avoid the following stalls.
```
~frozen_mutation_fragment at ././frozen_mutation.hh:268
std::destroy_at<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_construct.h:88
std::allocator_traits<std::allocator<std::_List_node<frozen_mutation_fragment>
> >::destroy<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:537
std::__cxx11::_List_base<frozen_mutation_fragment,
std::allocator<frozen_mutation_fragment> >::_M_clear at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/list.tcc:77
~_List_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_list.h:499
~partition_key_and_mutation_fragments at ././repair/repair.hh:298
~repair_row_on_wire_with_cmd at ././repair/repair.hh:335
operator() at ./repair/row_level.cc:1881
```
Fixes#21016
Set the tombstone GC time for group0-managed tables to the minimal state
id of the group0 nodes.
The check is being done based on a timer, iterating through each node
(according to the group0 topology configuration) and taking the minimum
across all nodes.
This miminum timestamp is then be used to set the tombstone GC time
for the tombstone GC of all the group0-managed tables.
Fixes: scylladb/scylla#15607
Move the repair_map definition to the tombstone_gc file where it is
mostly being used.
Refactor and add the accessors and setters for the group0 tombstone GC
time.
Implemented the group0 state_id handler (based on the gossip) that will
broadcast the group0 state id of each node.
This will be used to set the tombstone GC time for the group0 tables.
Scylla doesn't allow for the types of arguments or the return type of a UDF
to be frozen. As a result, before these changes, create statements
produced to restore UDFs as part of `DESCRIBE` statements could not
be executed.
Fixesscylladb/scylladb#20256
Backport: necessary as the restore process may not work correctly without these changes. The affected versions span from 5.2 to the current master, but we only want to apply the fix to the live versions, so 6.0, 6.1, and 6.2.
Closesscylladb/scylladb#20816
* github.com:scylladb/scylladb:
cql3/functions/user_function: Print arguments and return type without frozen
cql3/functions/user_function: Use fmt to format create statement
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.
Fixes: #20039
Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.
Closesscylladb/scylladb#20208
* github.com:scylladb/scylladb:
cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
cql: join new and old KS options in ALTER tablets KS
cql: fix validation of ALTERing RFs in tablets KS
cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
cql: validate RF change for new DCs in ALTER tablets KS
cql: extend test_alter_tablet_keyspace_rf
cql: refactor test_tablets::test_alter_tablet_keyspace
cql: remove unused helper function from test_tablets
Group0 server is often used in asynchronous context, but we do not wait
for them to complete before destroying the server. We already have
shutdown gate for it, so lets use it in those asynch functions.
Also make sure to signal group0 abort source if initialization fails.
Fixesscylladb/scylladb#20701
Backport to 6.2 since it contains af83c5e53e and it made the race easier to hit, so tests became flaky.
Closesscylladb/scylladb#20891
* github.com:scylladb/scylladb:
group: hold group0 shutdown gate during async operations
group0: Stop group0 if node initialization fails
Currently, `estimated_pending_compactions` uses a precalculated value calculated by `update_estimated_compaction_by_tasks`, which, in turn, is called by `get_compaction_candidates`. That means that, if `estimated_pending_compactions` is called, e.g. right after major compaction, it will return an outdated value that was calculated prior to major compaction, and so, it is no longer relevant.
Instead, just recalculate the value in `estimated_pending_compactions` and drop `update_estimated_compaction_by_tasks`.
* Enhancement, no backport required
Closesscylladb/scylladb#20892
* github.com:scylladb/scylladb:
test: cql-pytest: test_compaction: add test_compactionstats_after_major_compaction
test/cql-pytest: rename test_compaction{_tombstone_gc,}
time_window_compaction_strategy: estimated_pending_compactions: reestimate compactions rather than using cached value
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.
The patch makes sure that there is no end iterator in _live_iterators.
Fixesscylladb/scylladb#20874Closesscylladb/scylladb#20977
Unfortunately, the replacement for boost::range::join(),
std::views::concat(), is in C++26 (and not implemented in libstdc++ 14).
We use array/transform/join to simulate it.
Commit efd65aebb2 ("build: cmake: add check-header target", 2023-11-13)
introduced three typos:
- In "cmake/check_headers.cmake", it checked whether the
"parsed_args_GLOB_RECURSE" argument was defined, but then it referenced
the same under the wrong name "parsed_args_RECURSIVE".
- The above error masked two further typos; namely the duplicate use of
"api" and "streaming" each, as targets. With "parsed_args_GLOB_RECURSE"
above fixed, CMake now reports these conflicting arguments (target
names). They should have been "node_ops" and "sstables", respectively.
Correct the typos.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#20992
Scylla doesn't allow for the types of arguments or the return type
to be frozen. As a result, before these changes, create statements
produced to restore UDFs as part of `DESCRIBE` statements could not
be executed.
We fix that and add a reproducer test and another one to verify that
the implementation is correct.
Before this patch, the "/localnodes" HTTP request to the Alternator server
lists all the live nodes of the current DC. This patch adds two optional
parameters to this query:
dc: allows to list the live nodes of a specific named DC instead of the
current DC of the server.
rack: allows to restrict the results to just the nodes belonging to a
specific named rack.
For both options, if no live node exists in the given dc or rack (in
particular, if such a dc or rack doesn't even exist), an empty list is
returned - it's not an error.
The default, if dc or rack is not specified - remains exactly as it is
today - look at the current DC (the one of the node being request), and
do not restrict the list to any specific rack.
We expect the new options that we added here to be useful for two use cases:
1. A client that knows of *some* Scylla node (belonging to an unknown DC),
but wants to list the nodes in *its* DC, which it knows by name.
2. A client in a multi-rack DC (e.g., multi-AZ region in AWS) that wants
to send requests to nodes in its own rack (which it knows by name),
to avoid cross-rack networking costs.
Note that in both cases, this requires clients to know the names of DCs
and AZs via some out-of-band means. The client can also get a list of DCs
and racks using the system.local system table, as the tests included in
this patch demonstrate.
This patch includes two set of tests for these new options: One in the the
single-node test/alternator framework that has a single dc and rack but
can still check the case of an unknown dc or rack (in which case an empty
list is returned). The second test is in the topology framework, and runs
an 8-node cluster with two DCs, two racks, and two nodes in each, and
checks all the combinations of "/localnodes" requests with and without
dc and rack options. This test also resolves a longstanding TODO that
asked for such a multi-DC test for "/localnodes" to be written.
Fixes#12147
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20915
The helper makes sstables from env directly. Callers may not create the
factor after that. Less code the better.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20983
since we only need the full definition of boost::regex in the .cc
file, where we
- define the constructor and destructor
- and actually use the regex.
there is no need to include boost/regex.hpp in the header, in order
to keep the preprocessed header smaller. let's use a header only
contains forward declarations in header, and include the full
definition in the .cc file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.
unconst is a small help that converts a const iterator to a non-const
iterator with the help of the container. Currently it is using the
boost iterator/range libraries.
Convert it to <ranges> as part of an effort to standardize on a single
range library. Its only user in mutation_partition is converted as well.
Due to more iteroperability problems between <range> and boost, some
calls to boost::adaptors::reversed have to be converted as well.
The <ranges> library checks that an iterator's operator++() returns
a reference to the same type. intrusive_btree's iterator do not; instead
they return some base type and rely on implicit conversion to the real
iterator type. This causes interoperatibility problems with <range>.
Fix by using the CRTP pattern to inform iterator_base about what
type we really are, and cast to it. Enforce it with static_assert.
Note we can't static_assert in class scope since it is checked too
early and fails. Checking in function scope delays the check.
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
needs their RFs to be validated.
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.
Refs: #20039
1. Renamed the testcase to emphasize that it only focuses on testing
changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8
All current unit tests for scrub in validate mode generate random
SSTables on the fly.
Add some more tests with frozen Cassandra SSTables from the source tree
to verify compatibility with Cassandra. Use some of the existing 3.x
Cassandra SSTables to test the valid case, and use the same schema to
generate some corrupted SSTables for the invalid case. Overall, the new
tests cover the following scenarios:
* valid compressed/uncompressed
* compressed/uncompressed with invalid checksums
* compressed/uncompressed with invalid digest
For the compressed SSTable with invalid checksums, a small chunk length
was used (4KiB) to have more chunks with less disk space. For
uncompressed SSTables the chunk length is not configurable.
Finally, since the SSTables live in the source tree, the quarantine
mechanism was disabled.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Allow `perform_sstable_scrub()` to disable quarantine for invalid
SSTables detected by scrub in validate mode. This is already supported
by the lower-level function `scrub_sstables_validate_mode()` via the
flag `quarantine_sstables` and is being used by sstable-scrub.
Propagate the flag up to `perform_sstable_scrub()`. This will allow to
test scrub/validate against read-only SSTables from the source tree.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The scrub_test_framework, which is the foundation for all scrub-related
tests, always generates a random schema upon initialization and makes it
available to the user. This is useful for running tests with ephemeral
SSTables, but is redundant when the creation of the SSTable predates the
test (e.g., it lives in the source tree).
Turn scrub_test_framework into a template with a boolean parameter to
optionally switch off the random schema generation. Also, add an
overload for run() to support passing a ready-to-use SSTable instead of
mutation fragments.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In a previous patch we extended the validation path of the SSTable layer
to validate the digests along with the checksums.
Add two tests for compressed and uncompressed SSTables to test the
validation API against SSTables with valid checksums but corrupted
digests.
Add two more tests to ensure that the absence of digest does not affect
checksum validation.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently, every scrub/validate test is duplicated to cover both
compressed and uncompressed SSTables. However, except for the
compression type, the tests are identical. This leads to some code
bloat.
Introduce common functions parameterized by the compression type to
reduce code duplication. Also, group together the compressed and
uncompressed variants into one compression-agnostic test.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the validation path to perform digest checking on all SSTables.
This is achieved by loading the digest component on demand and passing
it to the underlying data sources only during validation. The data
sources for compressed and uncompressed SSTables were modified in
previous patches to support digest checking.
Consider digest checking as part of the integrity checking mechanism
(i.e., requires `integrity_check::yes`) to ensure it remains disabled
for all reads happening outside of the validation path (i.e.,
`sstable::validate()`). This practically means that digest checking is
enabled only for:
* scrub in validate mode
* sstable validate
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
There are two issues in it. First, listing the registry with a consumer callback passes wrong argument to the consumer. Second, the primary key of the registry is wrong. Both issues don't show up, because existing tests that use mock don't read from it, only write. Tests that read from registry are python tests that start scylla and thus use real registry.
Closesscylladb/scylladb#20946
* github.com:scylladb/scylladb:
test: Use corrcet key in sstables registry mock
test: Pass entry status to mock registry consumer
This patch adds reproducer tests (still failing) for issue #20755, which
is about missing validation of materialized view names:
1. Unlike table and keyspace names which are limited to 48 characters,
we forgot to limit view name length, and an excessively long name can
cause Scylla to shut down :-(
2. Unlike table and keyspace names which only allow alphanumeric
characters, view names are missing this check and can include any
characters.
3. Luckily, even though we are missing the alphanumeric check, we at
least don't allow "/" in view names (if we allowed them, it could
allow users to write in any directory in the filesystem!). But when
this happens, we get an internal error instead of the expected
errors.
The first test also fails on Cassandra (it doesn't crash it, but leaves
the table in a strange state), but the other two pass.
Refs #20755
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20761
Some adjustments to the `.clang-format` options to better match the current code:
* don't sort the include headers: causes large diffs especially in files with a lot of includes, and the `#include` ordering is not prescribed by the Seastar coding style
* binpack the arguments in function declarations and calls: allow binpacking (as opposed to forcing each parameter on a separate line if they don't fit into the line length)
* indented parameter continuation (as opposed to aligning to the open parenthesis) - aligning to the open parenthesis causes alignment issues especially with lambdas
Fixes: scylladb/scylladb#20951
No backport: Not a product issue, just applies to master.
Closesscylladb/scylladb#20968
* github.com:scylladb/scylladb:
clang-format: argument and function packing
clang-format: don't sort the include headers
The one was broken from the very beginning. It only checked that after
creating a table, its directory is created in all datadirs. But it
didn't check that after restart populating happens from the all. That's
because all directories by 0th were always empty, so not-populating from
them didn't skip any data.
Fix it by moving all sstables from datadirs[0] to datadirs[1] before
restart. With that update not-populating data from datadirs[1] will
be noticed instantly. Fortunately, previous patches fixed that, so the
test still passes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current code uses datadir it gets from table itself, which is the 0th
element in the all-datadirs config. So populating local sstables happens
several times from the same directory. Fix it by starting sstable
directory with correct datadir -- the one obtained from the all-datadirs
loop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It now happens in the outer loop, but it's not correct for S3 storage,
which is thus asked to collect its data twice. Also it's broken for
local storage as well, because the datadir argument is ignored. Next
patch will fix it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Collecting sstables for local storage and for S3 storage differs. First,
the populator collects sstables for each datadir configured in
scylla.yaml, but S3 storage doesn't care, so it's effectively asked to
collect the same data twice. Second, S3 collector code uses
sstable_directory simply because that class is used by reshape and
reshard code, but in fact collecting of S3 sstable can be made much
simpler (but that's for later).
Having said that, split preparation of sstables population for local
and S3 storage types.
Indentation is deliberately left broken for local storage collecting
mathod. That's because otherwise next patch will need move it back
anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Knowledge of sstable state is no longer needed in the table_populator
start/stop methods, so the map<state, directory> can be converted into
vector<directory>.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to populate_subdir() one, it also accepts state and gets
directory out of it. Patch is the same way -- caller now passes it the
reference to directory and doesn't care about the state (in fact, the
start_subdir() doesn't care of the state either).
While at it -- rename the method to reflect what it does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to make populate_subdir() be self-contained in a way it uses
passed sstable_directory and make caller not care about the state.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The populate_subdir() accepts sstable_state argument and picks the
corresponding sstable_directory object from the map. Patch it so that
caller passes it the sstable_directory reference. For now it makes
things more complicated, but next patches will simplify it back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In the old days the set of sstable_directory-s used by populator could
skip some of them. Now they are all present and the checks is always
false.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tables not necessarily have data in a directory, so it's more correct to
show storage options in logs, not some directory path.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When population throws, the catch block prepares a message to re-throw
another exception and prints the same message into logs. Presumably the
intent was to print the prepared message as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's one constructor that became unused after 787ea4b1. Modify it
with the 'state' argument so that it could be used later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
simpler this way. `sst` does not help with the readability or
performance, but let's drop it. simpler this way. also, remove the
unused parameter.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20961
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.
please note, since we have `using seastar::shared_ptr` in
`seastarx.h`, this renders `#include <seastar/core/shared_ptr.hh>`
unnecessary if we don't need the full definition of `seastar::shared_ptr`.
so, in this change, all the unused includes are removed.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#20963
* github.com:scylladb/scylladb:
.github: add db to iwyu's CLEANER_DIR
db: remove unused includes
in 787ea4b1, we introduced `_prefix` and `_sstables` member variables
to `sstables_loader::download_task_impl`, replacing the functionality
of `_snapshot_name`. However, we overlooked removing the now-obsolete
`_snapshot_name` variable.
this commit removes the unused `_snapshot_name` member variable to
improve code cleanliness and prevent potential confusion.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20969
Test that compactionstats are empty, i.e.
there are no required compactions following
major compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, `estimated_pending_compactions` uses a precalculated value
calculated by `update_estimated_compaction_by_tasks`, which, in turn,
is called by `get_compaction_candidates`. That means that, if
`estimated_pending_compactions` is called, e.g. right after
major compaction, it will return an outdated value that was
calculated prior to major compaction, and so, it is no longer
relevant.
Instead, just recalculate the value in `estimated_pending_compactions`
and drop `update_estimated_compaction_by_tasks`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Commit af83c5e53e moved aborting of group0 into the storage service
drain function. But it is not called if node fails during initialization
(if it failed to join cluster for instance). So lets abort on both
paths (but only once).
when compiling with clang-19 and the standard library from GCC-14.2,
we have:
```
/usr/bin/cmake -E __run_co_compile --tidy="clang-tidy;--checks=-*,bugprone-use-after-move;--extra-arg-before=--driver-mode=g++" --source=/__w/scylladb/scylladb/utils/bloom_filter.cc -- /usr/bin/clang++ -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -I/__w/scylladb/scylladb -I/__w/scylladb/scylladb/seastar/include -I/__w/scylladb/scylladb/build/seastar/gen/include -I/__w/scylladb/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/__w/scylladb/scylladb/build=. -march=wes
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:81:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
81 | filter_ptr create_filter(int hash, large_bitset&& bitset, filter_format format) {
| ^
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:82:12: error: no viable conversion from returned value of type '__detail::__unique_ptr_t<murmur3_bloom_filter>' (aka 'unique_ptr<utils::filter::murmur3_bloom_filter>') to function return type 'int' [clang-diagnostic-error]
82 | return std::make_unique<murmur3_bloom_filter>(hash, std::move(bitset), format);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:85:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
85 | filter_ptr create_filter(int hash, int64_t num_elements, int buckets_per, filter_format format) {
| ^
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:86:12: error: no viable conversion from returned value of type '__detail::__unique_ptr_t<murmur3_bloom_filter>' (aka 'unique_ptr<utils::filter::murmur3_bloom_filter>') to function return type 'int' [clang-diagnostic-error]
86 | return std::make_unique<murmur3_bloom_filter>(hash, large_bitset(get_bitset_size(num_elements, buckets_per)), format);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error: /__w/scylladb/scylladb/utils/bloom_filter.hh:93:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
93 | filter_ptr create_filter(int hash, large_bitset&& bitset, filter_format format);
| ^
Error: /__w/scylladb/scylladb/utils/bloom_filter.hh:94:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
94 | filter_ptr create_filter(int hash, int64_t num_elements, int buckets_per, filter_format format);
| ^
Error: /__w/scylladb/scylladb/utils/i_filter.hh:17:25: error: no template named 'unique_ptr' in namespace 'std' [clang-diagnostic-error]
17 | using filter_ptr = std::unique_ptr<i_filter>;
| ~~~~~^
Error: /__w/scylladb/scylladb/utils/i_filter.hh:54:12: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
54 | static filter_ptr get_filter(int64_t num_elements, double max_false_pos_prob, filter_format format);
| ^
4 warnings and 8 errors generated.
```
apparently, the definition of `std::unique_ptr` is missing where it is
used. so let's include `<memory>`, so that `i_filter.hh` is more
self-contained.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20971
Commit aa1270a00c changed most uses
of `assert` in the codebase to `SCYLLA_ASSERT`.
But the comment fixed in this patch is talking specifically about
`assert`, and shouldn't have been changed. It doesn't make sense
after the change.
Closesscylladb/scylladb#20967
Currently, `cached_file::stream` (currently used only by index_reader,
to read index pages), works as follows.
Assume that the caller requested a read of the range [pos, pos + size).
Then:
- If the first page of the requested range is uncached,
the entire [pos, pos + size) range is read from disk (even if some
later pieces of it are cached), the resulting pages are added to the cache,
and the read completes (most likely) from the cached pages.
- If the first page of the read is cached, then the rest of the read
is handled page-by-page, in a sequential loop, serving each page
either from cache (if present) or from disk.
For example, assume that pages 0, 1, 2, 3, 4 are requested.
If exactly pages 1, 2 are cached, then `stream` will read the entire [0, 4] range
from disk and insert the missing 0, 3, 4, and then it will continue serving the
read from cache.
If exactly pages 0 and 3 are cached, then it will serve 0 from cache,
then it will read 1 from disk and insert it into cache,
then it will read 2 from disk and insert it into cache,
then it will serve 3 from cache,
then it will read 4 from disk and insert it into cache.
If exactly the first page is cached, a 128 kiB read turns
into 31 I/O sequential read ops.
This is weird, and doesn't look intended. In one case, we are reading even pages
we already have, just to avoid fragmenting the read, and in the other case
we are reading pages one-by-one (sequentially!) even if they are neighbours.
I'm not sure if cached_file should minimize IOPS or byte throughput,
but the current state is surely suboptimal. Even if its read strategy
is somehow optimal, it should still at least coalesce contiguous reads
and perform the non-contiguous reads in parallel.
This patch leans into minimizing IOPS. After the patch, we serve
as many front pages from the cache as we can, but when we see
an uncached page, we read the entire remainder of the read from disk.
As if we trimmed the read request by the longest cached prefix,
and then performed the rest using the logic from before the patch.
For example, if exactly pages 0 and 3 are cached,
then we serve 0 from cache,
then we read [1, 4] from disk and insert everything into cache.
For partially-cached files, this will result in more bytes read
from disk, but less IOPS. This might be a bad thing. But if so,
then we should lean the other way in a more explicit and efficient
way than we currently do.
Closesscylladb/scylladb#20935
Changes to better match the Seastar code style and the current codebase.
Allow parameter binpacking and continuation indenting.
Refs: scylladb/scylladb#20951
Sorting the include headers causes reordering of all headers and thus
large diffs, especially in the files that include a lot of headers that
have not been sorted before. This makes it harder to review the changes
and to understand the history of the file.
The Seastar code style doesn't prescribe any include headers ordering.
Refs: scylladb/scylladb#20951
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.
please note, since we have `using seastar::shared_ptr` in
`seastarx.h`, this renders `#include <seastar/core/shared_ptr.hh>`
unnecessary if we don't need the full definition of `seastar::shared_ptr`.
so, in this change, all the unused includes are removed. but there are
some headers which are actually used, while still being identified by
this tool. these includes are marked with "IWYU pragma: keep".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The restore-from-s3 task uses load-and-stream internally which, in turn, unlinks loaded sstables on success. That's not what user expects when it restores from backup, objects should remain in bucket afterwards.
Closesscylladb/scylladb#20947
* github.com:scylladb/scylladb:
test: Add check that restored-from objects are not removed
sstables_loader: Dont unlink sstables when restoring from S3
sstables_loader: Make primary_replica_only bool_class RAII field
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.
in order to fill the gap, in this change:
* add the "prefix" parameter for specifying the shared prefix of
sstables
* add the "sstables" parameter for specifying the list of TOC
components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
on scylla's end anymore.
* make the "table" parameter mandatory.
Fixes https://github.com/scylladb/scylladb/issues/20461
----
this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt.
Closesscylladb/scylladb#20685
* github.com:scylladb/scylladb:
treewide: accept list of sstables in "restore" API
sstable: pass get_storage_option to sstable_directory::load_sstable()
test/nodetool: add body parameter to `expected_request`
tools/scylla-nodetool: enable nodetool to write HTTP body
Those setting and getting bandiwdth need compaction manager to work with
and thus should sit next to other enpoints working with it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to other .cc files, compaction manager should have its
endpoints unset. For now, no batch unsetting exists, so need to do it
one-by-one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The register_api() helper does exatly what's needed here -- registers
function and calls a method to set routes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes https://github.com/scylladb/scylladb/issues/20044.
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.
Closesscylladb/scylladb#20939
* github.com:scylladb/scylladb:
replica: Fix tombstone GC during tablet split preparation
service: Improve error handling for split
Extend `read_digest()` to first check if the digest component exists
before attempting to load it from disk.
Make `validate_checksums()` throw an error if the component does not
exist to preserve its current behavior.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
SSTables store their digest in a Digest file. Add this in the list of
SSTable components. In a follow-up patch we will use this component to
enable digest checking in the validation path.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Following the addition of digest check in the checksummed data source,
add the same feature to the compressed data source as well. This ensures
consistent behavior across any type of SSTable.
This is added as an optional feature so that we can preserve the current
behavior, that is verify only the per-chunk checksums during normal user
reads. To ensure zero cost at runtime when disabled, we introduce the
on/off switch as a template parameter.
The digest calculation for compressed SSTables depends on the SSTable
format, hence the new template argument for the checksum mode. This is
consistent with the compressed data sink.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The checksummed data source verifies the checksum of each chunk in the
data files of uncompressed SSTables. This is being leveraged by scrub
in validation mode.
Extend the data source to check the digest (full checksum) as well.
Unlike checksums, this is added as an optional feature so that SSTables
without a digest can still be validated in a per-chunk basis. To enable
this, the caller needs to set the template parameter `check_digest` to
true, and provide the expected digest.
The data source calculates the digest incrementally through multiple
get() calls and compares against the expected digest after reading the
whole file range. If there is a mismatch, it throws an exception.
Checking the digest requires reading the whole data file. If this cannot
be satisfied (e.g., due to partial read or skip()), the data source
fails immediately. If the user has successfully read the whole file
range, it can be safely assumed that the digest is valid.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Any release < 6.0 or < 2023.1 is EOL and need not be supported by
scylla-gdb.py anymore. Remove compatibility code for these releases.
Closesscylladb/scylladb#20918
previously change, implementation was unnecessarily verbose and less
efficient, as it created and immediately discarded temporary strings.
remove unnecessary use of `fmt::to_string()` when arguments are already
being formatted by `seastar::format()`.
in this this change:
- eliminates creation of temporary `std::string` instances
- reduces memory allocations and copies
- improves performance
- simplifies the code
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20923
This is optimization.
Example:
block0: start=aaa, end=aaA
block1: start=bbb, end=bbB
block2: whatever
Before the patch, advance_to("aAA") would skip to block0, and upper
bound probe would skip to block1. This way, the reader would read the
range of block0 from the data file.
After the patch, "end" position is taken into account, so
advance_to("aAA") will notice that block0 doesn't contain the position
and will skip to block1. This is especially important for dense
indexes, as it allows us to skip accessing data file if the search key
is missing.
It also solves the edge case problem related to the fact that single
row reads are using a range which with positions which are not equal
to the key, but are before(key) and after(key) for the lower bound and
upper bound respectively. Before the patch, advance_to(before("bbb"))
would skip to block0, before the position is before the block1's
start. And upper bound probe for after("bbb") would point to
block2. This way the read would scan block0 needlessly. After the
patch, advance_to(before("bbb")) will skip to block1 because we notice
based on "end" that block0 doesn't contain the position.
This change also ensures that the start position of the upper bound
entry of the after_key(pos), where pos is the last advance_to()
position, is warm in cache. This is needed to optimize single-row
reads with a dense index so that they always read exactly one promoted
index block. For this to work, probe_upper_bound() for the
after_key(row) always needs to find the upper bound block in
cache.
It was unnecessary to emit a skip info for the first block since it
follows immediately the partition start, but it is relevant to the
optimization of avoiding data reads for missing keys. This
optimization relies on the fact that lower bound position equals upper
bound position. If the reader's key is before the first key in the
partition and we don't arm the skip info for the first block, lower
bound would be equal to the partition start, and upper bound would be
equal to the first row's position, which are not equal.
Currently, it may happen that the last promoted index block includes
the partition_end marker. That's because we first write the partition
end marker and then emit the unclosed block. This behavior matches
Cassandra (checked in 3.x and 5.0.1).
This is problematic for ruling out data file reads based on index.
The width field is currently unused, but it will be used later where
the width of the last block is used to compute the skip position past
the last block for lookups which land after all keys in the
partition. If width includes the marker then such a skip would land in
the next partition, which is incorrect, as the reader context expects
a cell element. Even if that was recognized, it's wrong - if this is
not a single partition read (so upper bound is not at the next
partition too), then we would read from the wrong (next) partition.
We want to be able to make such skips in order to avoid unnecessary
data file IO for reads of missing rows. Currently, we would always
read the last block even if the key is past its "end" position.
Another way to solve this would be to propagate the "past the last
block" condition from the index cursor to the reader and let it deal
with it, but the logic for that would be complicated. With this fix,
there is no special logic required.
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.
The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.
Refs: #18269Closesscylladb/scylladb#20703
- As part of deprecation of IP address usage, warning messages were added when IP addresses specified in the `ignore-dead-nodes` and `--ignore-dead-nodes-for-replace` options for scylla and nodetool.
- Slight optimizations for `utils::split_comma_separated_list`, ` host_id_or_endpoint lists` and `storage_service` remove node operations, replacing `std::list` usage with `std::vector`.
Fixesscylladb/scylladb#19218
Backport: 6.2 as it's not yet released.
Closesscylladb/scylladb#20756
* github.com:scylladb/scylladb:
config: Add a warning about use of IP address for join topology and replace operations.
nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
tests: Fix incorrect UUIDs in test_nodeops
utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
Explicitly disable tablets for features which still dont' work with tablets: cdc, lwt, coutners.
Closesscylladb/scylladb#20858
* github.com:scylladb/scylladb:
test: make cdc tests pass with tablets on by default
test: make cql/counters* pass with and without tablets
test: make cql/lwt_* pass with and without tablets
test: rename cql/list_test to cql/lwt_list_test
when building scylla with the standard library from GCC-14.2, shipped by
fedora 41, we have following build failure:
```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/init.cc.o -MF CMakeFiles/scylla-main.dir/Debug/init.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/init.cc.o -c /home/kefu/dev/scylladb/init.cc
In file included from /home/kefu/dev/scylladb/init.cc:12:
In file included from /home/kefu/dev/scylladb/db/config.hh:20:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
410 | return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
| ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
410 | return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
| ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
410 | return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
| ^
3 errors generated.
[16/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/keys.cc.o
[17/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/counters.cc.o
[18/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/partition_slice_builder.cc.o
[19/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
FAILED: CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -MF CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -c /home/kefu/dev/scylladb/mutation_query.cc
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:11:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
410 | return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
| ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
410 | return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
| ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
410 | return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
| ^
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:37:
In file included from /home/kefu/dev/scylladb/db/snapshot-ctl.hh:20:
/home/kefu/dev/scylladb/tasks/task_manager.hh:403:54: error: no member named 'irange' in namespace 'boost'
403 | co_await coroutine::parallel_for_each(boost::irange(0u, smp::count), [&tm, id, &res, &func] (unsigned shard) -> future<> {
| ~~~~~~~^
4 errors generated.
```
so let's take the opportunity to switch from `boost::irange` to
`std::views::iota`.
in this change, we:
- switch from boost::irange to std::views::iota for better standard library compatibility
- retain boost::irange where step parameter is used, as std::views::iota doesn't support it
- this change partially modernizes our range usage while maintaining
- existing functionality
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20924
When load_and_stream() completes, all sstables that were loaded (and
streamed) are unlinked. This is wrong for the restore-from-s3 task, as
removing objects from backup storage is not what user expects.
Fix it by adding a boolean to streamer class, and set it to false (well,
bool_class<>::no) for restore task.
fixes: #20938
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This boolean is currently passed all the way around as pure bool
argument. And it's only needed in a single get_endpoints() method that
calculates the target endpoints.
This patch places this bool on class streamer, so that the call chain
arguments are not polluted, and converts it to bool_class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The "real" registry defines its primary key as (location, generation)
pair, where location is the partition key and generation is clustering
key. The registry mock uses only location part as primary key, while it
must use both.
The buggy mock works simply because the listing API is in fact not used
by unit tests. Those tests that do need it are python tests that start
scylla and thus implicitly use real registry.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sstables registry is listed, the passed consumer accepts entry
status as its first argument, not its location (location is passed as a
search key)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.
since we decided to use Boost.Regex, let's catch `boost::regex_error`.
Refs a3db5401Fixesscylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20942
Before these changes, we could create a materialized
view specifying its ID, but the option was ignored.
This commit makes Scylla respect the option. Now specifying
the ID results in the MV being created with that specific ID.
This way, Scylla's behavior is consistent with Cassandra's.
Because Cassandra doesn't mention the option in its
user documentation, we don't update it either in case
the semantics of it changes in the future -- we want
to have an open door for any modifications.
Note that Cassandra returns a server error if the provided
ID is already in use, both in the case of regular tables
and MVs. That's most likely a bug. Instead of following that
behavior, we stay consistent with the current semantics of
creating a regular table in Scylla: if the provided ID is
already used, return an InvalidRequest.
The last thing worth pointing out is Cassandra handles
`WITH ID = null` as a special case; normally, specifying
an invalid ID results in a ConfigurationException, but a null
is treated as a syntax error. As in the previous paragraph,
we stay consistent with the semantics of regular tables and
all invalid IDs, null included, lead to a ConfigurationException.
We also add a few short tests verifying that the implementation
works as intended.
Fixesscylladb/scylladb#20616
Backport not needed: the semantics of the option was never documented in either Cassandra, or Scylla.
Closesscylladb/scylladb#20773
* github.com:scylladb/scylladb:
test/cql-pytest: Get rid of unnecessary processing describe statements
cql3: Make creating MV respect ID option
In order to avoid cases during tablet migrations where we garbage
collect tombstones before the data it shadows arrives, we will
disable tombstone GC on pending replicas.
To achieve this we added a tombston_gc_enabled flag to compaction_group.
This flag is updated from updte_effective_repliction_map method of the
tablet_storage_group_manager class.
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes#20044.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This change adds the flag tombstone_gc_enabled to compaction_group.
The value of this flag will be set in
tablet_storage_group_manager::update_effective_replication_map().
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.
Fixes#20890.
This test is actually testing lists with LWT, so should
have the corresponding name. Going forward we'll patch CQL LWT
tests for tablets, so let's group them together.
operations.
When the '--ignore-dead-nodes-for-replace' config option contains
IP addresses, a warning will be logged, notifying the user that
using IP addresses with this option is deprecated and will no
longer be supported in the next release.
Fixesscylladb/scylladb#19218
Since we are deprecating the use of IP addresses, a warning message will be printed
if 'nodetool removenode --ignore-dead-nodes' is used with IP addresses.
- utils::split_comma_separated_list now accepts a reference to sstring instead
of a copy to avoid extra memory allocations. Additionally, the results of
trimming are moved to the resulting vector instead of being copied.
- service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps,
parse_node_list and api/storage_service::set_storage_service were changed to use
std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as
std::vector is a more cache-friendly structure, resulting in better performance.
CI started reporting warnings about including `bytes.hh` in
several files. The reason is they actually only use code
introduced in `bytes_fwd.hh` (which is also included by `bytes.hh`).
Clang-include-cleaner suggests that we get rid of that indirection
and only include `bytes_fwd.hh`. That's what happens in this commit.
We include `bytes.hh` in `exceptions/exceptions.cc` because
it relies on the formatting utilities declared and defined
in `bytes.hh`.
Closesscylladb/scylladb#20842
As part of scylladb/scylladb@d42f160, we added a test verifying that
restoring the schema works as intended. Unfortunately, because of
scylladb/scylladb#20616, we had to manually process the results
of `DESCRIBE SCHEMA` to exclude the ID parameter and be able to
compare restore statements corresponding to the same view.
Now that materialized views respect the ID parameter, we can get rid
of that logic.
Before these changes, we could create a materialized
view specifying its ID, but the option was ignored.
This commit makes Scylla respect the option. Now specifying
the ID results in the MV being created with that specific ID.
This way, Scylla's behavior is consistent with Cassandra's.
Because Cassandra doesn't mention the option in its
user documentation, we don't update it either in case
the semantics of it changes in the future -- we want
to have an open door for any modifications.
Note that Cassandra returns a server error if the provided
ID is already in use, both in the case of regular tables
and MVs. That's most likely a bug. Instead of following that
behavior, we stay consistent with the current semantics of
creating a regular table in Scylla: if the provided ID is
already used, return an InvalidRequest.
The last thing worth pointing out is Cassandra handles
`WITH ID = null` as a special case; normally, specifying
an invalid ID results in a ConfigurationException, but a null
is treated as a syntax error. As in the previous paragraph,
we stay consistent with the semantics of regular tables and
all invalid IDs, null included, lead to a ConfigurationException.
We also add a few short tests verifying that the implementation
works as intended.
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to reduce selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.
There is already infrastructure to lookup end position based on upper
bound of the read, but it's not effective becasue it's a
non-populating lookup and the upper bound cursor has its own private
cached_promoted_index, which is cold when positions are computed. It's
non-populating on purpose, to avoid extra index file IO to read upper
bound. In case upper bound is far-enough from the lower bound, this
will only increase the cost of the read.
The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.
We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately. This is especially important
for single-row reads, where the bounds are around the same key. In
this case we want to read the data file range which belongs to a
single promoted index block. It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache. Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.
Fixes#10030.
The change was tested with perf-fast-forward.
I populated the data set with `column_index_size_in_kb` set to 1
scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1
Test run:
build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0
This test reads two rows from the middle of a large partition (1M
rows), of subsequent keys. The first read will miss in the index file
page cache, the second read will hit.
Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.
Also, the first read which misses in cache is still faster after the change.
Before:
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4%
500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0%
After:
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1%
500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1%
(cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)
Will be needed by the reader to jump to the current block even if we
already advanced to it before, when setting up the reader context.
We want to advance to lower bound earlier, before the praser skips to
the lower bound. We want that in order to set input stream data file
range based on index. If we didn't have access to the current block
and used the result from advance_to(), the parser will think we're
already in the block which has lower_bound when it attempts to skip,
and will not skip, falling back to scanning.
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.
in order to fill the gap, in this change:
* add the "prefix" parameter for specifying the shared prefix of
sstables
* add the "sstables" parameter for specifying the list of TOC
components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
on scylla's end anymore.
* make the "table" parameter mandatory.
Fixesscylladb/scylladb#20461
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we always pass
`sstable_directory::_storage_opts` to `_manager.make_sstable()` in
`sstable_directory::load_sstable()`. but when loading from object
storage, we need to customize the storage_options on a per-sstable
basis. the way to address this is to allow the caller of
`sstable_directory::process_descriptor()` to pass a functor which
return the `storage_options` to be used when creating the sstable.
so, in this change, we update
- sstable_directory::load_sstable()
- sstable_directory::process_descriptor()
so that they accept another parameter to create the storage_options.
in the next commit we will pass a different functor for customizing
the storage_options on a per-sstable basis when loading sstables.
Refs scylladb/scylladb#20461
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, `expected_request` only includes query strings
for the parameters of requests. but we will add an API
("storage_service/restore") which accepts its parameters in HTTP body
as well.
in this change, we add an optional `body` member to `expected_request`,
so that we can mock the APIs which pass the parameters with the HTTP
body.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we always send the parameters with query strings,
but we will add an API ("storage_service/restore") which accepts its
parameters in HTTP body as well.
in this change, we add an optional parameter to `do_request()` and
`post()`, so that we can send HTTP body when using "POST" method
in nodetool implementation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Though database can be used to get relevant token metadata, it's better
not to use one service (database) as a proxy to get another one (token
metadata). In case of tokens, there's effective replication map at hand,
which is a more correct source of such topology information.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20894
Maintainers use scripts/pull_github_pr.sh from scylladb.git when merging PRs and before pushing to the next. We want to prevent merges from piling up on top of unstable builds. This change will check Gating's current status and notify the maintainers
Related to scylladb/scylla-pkg#3644Closesscylladb/scylladb#20742
The transport/controller.cc bypasses seastar API when making a few syscalls,
this PR will use the right seastar API to make the syscall and libc calls
this PR relies on few new APIs introduced in
seastar commit : cd7f3b8e8850cd80a4f6899cedc726e576c51abe
Closesscylladb/scylladb#17443Closesscylladb/scylladb#19565
Using the standard library is preffered over boost.
In cql3/expr/expression.cc to_sorted_vector got more of a
face-list and was modernized to use also std::unique
and while at it, to move its input range in the uniquely sorted
result vector.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
'static inline' is always wrong in headers - if the same header is
included multiple times, and the function happens not to be inlined,
then multiple copies of it will be generated.
Fix by mechanically changing '^static inline' to 'inline'.
Use database::get_snapshot_details to get the details
of all snapshots on disk, in particular those of
deleted tables.
Add test_snapshots_dropped_table to test listing
of snapshots of a deleted table.
And harden the existing test cases to use a unique
snapshot tag and to delete it when the test ends.
Fixes#18313
* No backport required at this time since this is rather minor UX issue that weren't hit in the field AFAIK
Closesscylladb/scylladb#20869
* github.com:scylladb/scylladb:
cql-pytest: test_virtual_tables: add test_snapshots_multiple_keyspaces
virtual_tables: snapshots: include all snapshots
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:
if (strategy.supports_tablets) {
if (cql.with_tablets) {
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
throw "tablets are not enabled";
}
} else if (cql.with_tablets = off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
return create_keyspace_without_tablets();
}
}
} else { // strategy doesn't support tablets
if (cql.with_tablets == on) {
throw "invalid cql parameter";
} else if (cql.with_tablets == off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
return create_keyspace_without_tablets();
}
}
closes: #20088
In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.
There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20779
Fix two recent regressions of the cmake build -- found this time in the test suite.
We (presumably) don't build stable releases (and their tests) with CMake, so backporting these fixes appears unnecessary, even if the regressions have been ported to stable branches.
@xemul @dawmd @tchaikov @tgrabiec @scylladb/scylla-maint
Closesscylladb/scylladb#20854
* github.com:scylladb/scylladb:
test/boost/bptree_test: fix the CMake build
test/boost/auth_test: fix the CMake build
before this change, clang-tidy is triggered by a pull request. but
there are chances that user wants to retrigger it. for jenkins
jobs, user can rebuild a job manually. but for workflow, only the
developers with write permission can retrigger a workflow. this
is not convenient to regular contributors.
so, in this change, another trigger is added, so that user can
trigger the clang-tidy workflow with "/clang-tidy" command.
the syntax is inspired by IRC commands.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20841
A primitive python http server is processing s3 client requests and issues either success or error. A multipart uploader should fail or succeed (with or without retries) depending on aforementioned server response
Handle the `finalize_upload` possible exception to abort the upload (which also can throw) and show the right error originated from the `finalize_upload`
Instead of ignoring the response for multipart upload completion start parsing it and look for a possible errors in the response body. If the error is found throw an exception
This patch removes a piece of code which, according to the comment,
allows for forwarding the index reader even if it was created as a
single-partition reader.
For single-partition reads, the input_stream
used by the reader is limited to the single index page containing the partition,
since reading the index file past that point would be a waste.
Because of this limit, such an index reader can't be forwarded/advanced.
The dubious piece of code gets around that by unsetting the stream and ensuring
it will be re-created, this time without the limit, if the index is advanced.
But there is no use for this. The idea of a "single-partition reader"
exist as an optimization. It's illegal to forward single-partition readers,
and it doesn't make sense to attempt that. (If there's a need for forwarding,
just don't create a single-partition reader).
I suspect this piece of code was written due to a misunderstanding.
Before the previous patch in this series, when the searched partition key
was the first key in its page, the index reader would scan the preceding
page first, realize it made a mistake, and advance to the next, correct page.
I suspect this piece of code was written to make this work.
But this is, in fact, undesirable. The fact that the index reader was working
like this was a performance bug. In the single-partition case there's
never an inherent reason to start with the wrong page. The index logic can be
corrected to always start with the right page, and that's what the previous
patch in this series does. And with that, there is no need to support advancing
anymore, and the dubious piece of code can be erased.
We also add an assert to emphasize that advancing a single-partition reader is
illegal.
When looking for a partition key in the index, we scan the index from the
first index page which can possibly contain the key.
In a single-partition read, there is never a reason to read beyond that
page. After the previous patch in this series, it's guaranteed that
the first key in the next page is strictly greater than the searched key.
So if the searched key is greater than the last key in the first page,
then it is neither in the first nor the second page -- it must be absent
from the sstable.
But with the current logic, we read the second index page anyway, and
the realization that the key is absent happens higher in the call chain.
This patch optimizes that inefficiency by immediately returning EOF
if a single-partition read doesn't find the key in the first page.
Returning "end of file" even though we didn't actually go beyond the
end of file is hacky, but I don't see any other non-invasive way
of communicating to the caller that the partition is absent.
Some caller of the index could possibly assume that returning EOF
proves that the searched key is greater than all keys in the sstable.
I don't think any such caller exists today, but it's a possible
place for confusion.
Together with the previous patch in this series, this patch guarantees
that a single-partition read only accesses a single index page.
This fixes a weird secondary performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever.
After this and the previous patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
When setting the index to position X, we first look for the first summary
entry N such that N >= X. Then we load the index page preceding N and
scan it for the first partition key P such that P >= X.
If there is no such key in this page, then we scan the next page
(starting with N) for such key. (In this case it's always the first key).
For example, assume we have:
summary: A C E
index: A B C D E F
If we look up "B" in the index, then we first locate summary entry "C",
then we scan the index for B, starting from "A". This is all fine.
But when we look for "C" in the index, then we do the exactly the same
-- we scan the index for "C" starting from "A". This is wasteful, because
we can start scanning from "C".
To avoid this inefficiency, we should be looking for N > X, not N >= X.
This patch fixes that.
In addition, this fixes a second, weirder performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever. After this patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
This fixes a use-after-free bug when parsing clustering key across
pages.
Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).
Details of the first problem:
Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.
To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.
In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.
A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.
If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.
The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.
This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.
Details of the solution:
We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.
We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.
Fixes#20766Closesscylladb/scylladb#20837
* github.com:scylladb/scylladb:
sstables: bsearch_clustered_cursor: Add trace-level logging
sstables: bsearch_clustered_cursor: Move definitions out of line
test, sstables: Verify parsing stability when allocating section is retried
test, sstables: Verify parsing stability when buffers cross page boundary
sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
cached_file: Adapt page_view to ContiguousSharedBuffer
cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
sstables, utils: Allow parsers to work with different buffer types
sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
Fixes#20862
With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.
Test included.
Closesscylladb/scylladb#20886
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.
During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.
In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.
Fixes#20699.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#20798
Removes the update command from the setup command.
This is required because versions now are not strictly pinned in the poetry.lock file since Sphinx ScyllaDB Theme 1.8.
Closesscylladb/scylladb#20876
To reduce the amount of space needed for reports, this PR will modify logs
attachment in allure, so it will attach logs only for the tests that have
status other than PASSED. To simplify the solution, with the current way it's
not possible to switch off these logs completely.
Closesscylladb/scylladb#20786
following headers are no longer used by this compilation unit:
- "utils/managed_ref.hh"
- "test/perf/perf.hh"
this was identified by clang-include-cleaner. As the code is audited,
we can safely remove the #include directive.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20850
since Python 3.13, passing count to `re.sub()` as positional argument
has been deprecated. and when runnint `test.py` with Python 3.13, we
have following warning:
```
/home/kefu/dev/scylladb/./test.py:1477: DeprecationWarning: 'count' is passed as positional argument
args.tests = set(re.sub(r'.* List configured unit tests\n(.*)\n', r'\1', out, 1, re.DOTALL).split("\n"))
```
see also https://github.com/python/cpython/issues/56166
in order to silence this distracting warning, let's pass
`count` using kwarg.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20859
Currently, node ops virtual task gathers its children from all nodes contained
in a sum of service::topology::normal_nodes and service::topology::transition_nodes.
The maps may contain nodes that are down but weren't removed yet. So, if a user
requests the status of a node ops virtual task, the task's attempt to retrieve
its children list may fail with seastar::rpc::closed_error.
Filter out the tasks that are down in node_ops::task_manager_module::get_nodes.
Fixes: #20843.
Closesscylladb/scylladb#20856
clang-tidy warns:
```
Warning: /__w/scylladb/scylladb/utils/directories.cc:132:52: warning: 'path' used after it was moved [bugprone-use-after-move]
132 | bool can_access = co_await file_accessible(path.string(), access_flags::read | access_flags::write | access_flags::execute);
| ^
/__w/scylladb/scylladb/utils/directories.cc:121:28: note: move occurred here
121 | verification_error(std::move(path), "File not owned by current euid: {}. Owner is: {}", geteuid(), sd.uid);
| ^
```
because we pass `std::move(path)` to `verification_error()`, and "then" use this variable again in this same function.
this is a false alarm, but we could make it very clear to convince this tool that it's safe to do so.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#20875
* github.com:scylladb/scylladb:
directories: mark verification_error() with [[noreturn]]
directories: pass const ref of path to verification_error()
Reduce compile time and unnecessary compilations by reducing #include load.
Minor refactoring, no backport.
Closesscylladb/scylladb#20864
* github.com:scylladb/scylladb:
raft_group0_client: uninclude "raft_group0_registry.hh"
raft_group_registry: extract raft_timeout
raft_group0_client: uninclude "mutation/mutation.hh"
raft_group0_client: uninclude "db/system_keyspace.hh"
db: system_keyspace: extract auth_version_t into its own header
this helps the compiler or static analyzers do make the right decision.
for instance, clang-tidy thinks a parameter like `std::move(path)`
could be reused after being moved away. with this attribute, this tool
should be able to tell that this never happens.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we pass a `path` to `verification_error()` by
moving away from the original `path`. this works fine in the sense
that it is correct and does not incur potential performance issues.
but clang-tidy considers it a used-after-move, because it cannot tell
`verification_error()` does not return at all, and believes that `path`
could be accessed again after being moved away. so it warns like:
```
Warning: /__w/scylladb/scylladb/utils/directories.cc:132:52: warning: 'path' used after it was moved [bugprone-use-after-move]
132 | bool can_access = co_await file_accessible(path.string(), access_flags::read | access_flags::write | access_flags::execute);
| ^
/__w/scylladb/scylladb/utils/directories.cc:121:28: note: move occurred here
121 | verification_error(std::move(path), "File not owned by current euid: {}. Owner is: {}", geteuid(), sd.uid);
| ^
```
in this change, instead of passing `fs::path` to `verification_error()`,
we pass a `const fs::path&` to this function. because
`verification_error()` is not coroutine, neither does it not pass `path` to
another continuation to be scheduled. so it's perfectly fine to pass
`path` to it.
this change address the false alarms from clang-tidy.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
While documenting materialized view in a new document (Refs #16569)
I encountered a few questions and this patch contains tests that
clarify their answer - and can later guarantee that the answer doesn't
unintentionally change in the future. The questions that these tests
answer are:
1. It is not allowed to filter a view on a static column (a comment
on the test explains why).
2. We already tested that it's not allowed to SELECT a static column into
a view. Here we add the check that "SELECT *" is also not allowed if
a static column exists in the base table.
3. We check that CREATE MATERIALIZED VIEW ... WITH COMMENT='..' works.
4. We check that CREATE MATERIALIZED VIEW ... WITH COMPACT STORAGE is
forbidden.
5. We check that CREATE MATERIALIZED VIEW ... WITH garbage=.. fails
with a clean InvalidRequest.
All these tests pass on both Scylla and Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20873
One of the design goals of the test/cql-pytest frameworks was to be able
to run these tests against Cassandra. Preferably, we should be able to
run most of the tests against any popular version of Cassandra, including
Cassandra 3. This is admittingly a very old version, but was still maintained
until just a year ago, it's the version that Scylla is most compatible with,
and we can still be curious about how it worked.
Until recently cql-pytest indeed worked on Cassandra 3, but it broke on some
change related to tablet detection that cause our most basic fixture -
"text_keyspace" - to use the Cassandra 4 feature of "auto expand".
This is trivial to fix - we should just use the this_dc fixture that we already
had exactly for this purpose.
Fixes#20781
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20782
Use database::get_snapshot_details to get the details
of all snapshots on disk, in particular those of
deleted tables.
Add test_snapshots_dropped_table to test listing
of snapshots of a deleted table.
And harden the existing test cases to use a unique
snapshot tag and to delete it when the test ends.
Fixesscylladb/scylladb#18313
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
* ./seastar 69f88e2f...3c9c2696 (14):
> core/reactor: don't check AIO block count when they are not needed
> build: do not print the default value of --c++-standard in help output
> json_formatter: Add tests for formatter::write
> Add APIs to get group details and to change ownership of file.
> scripts/perftune.py: improve a dry-run printout
> build: drop the workaround for a GCC bug
> cmake: Depend on libbsd if DPDK depends on it
> http: clarify the ownership in the router's doxygen comment
> build: check for P2582R1 support
> python: introduce a python formatting CI check
> addr2line: reformat with black
> scripts: add pyproject.toml
> json_formatter: Make formatter::write work for std::pair
> README.md: use the github homepage of Ceph for Crimson
Closesscylladb/scylladb#20836
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#19325Closesscylladb/scylladb#20528
* github.com:scylladb/scylladb:
scylla_raid_setup: configure SELinux file context
scylla_coredump_setup: fix SELinux configuration for RHEL9
It doesn't need it apart from a forward declaration.
Files that lost necessary includes are adjusted, and some users
of auth_version_t are redirected to the definition outside system_keyspace.
When `property_file` is provided, we generate a
`cassandra-rackdc.properties` file, but to actually use it,
`endpoint_snitch` must be set to `GossipingPropertyFileSnitch`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20730
In order to later use the formatter for the inner class
promoted_index_block, which is defined out of line after
cached_promoted_index class definition.
This fixes a use-after-free bug when parsing clustering key across
pages.
Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.
To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.
In b1b5bda, the parser was changed to keep shared fragments of the
buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.
If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.
The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.
This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.
The solution is to use page_view instead of temporary_buffer, which
can be safely shared via share() and stored across allocating
section. The page_view maintains its hold to the LSA buffer even
across allocating sections.
Fixes#20766
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.
In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.
It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.
When reset() is done due to allocating section retry, it can be
theoretically in an arbitrary point. So we should not assume that it
finished parsing and state was reset by previous parsing. We should
reset all the fields.
Collection stress tests include testing of B- B+- and radix trees, and those tests live in unit/ suite. There are also small corner-case tests for those collections in boost/ suite. There's an attempt to get rid of unit suite in favor of boost one, and this PR moves the collections stress testing from unit suite into their boost counterparts.
refs: scylladb/qa-tasks#1655Closesscylladb/scylladb#20475
* github.com:scylladb/scylladb:
test: Move other collection-testing headers from unit to boost
test: Move stress-collecton header from unit to boost
test: Move B+tree compactiont test from unit to boost
test: Move radix tree compactiont test from unit to boost
test: Move B-tree compactiont test from unit to boost
test: Move radix tree stress test from unit to boost
test: Move B-tree stress test from unit to boost
test: Move b+tree stress test from unit to boost
test: Add bool in_thread argument to stress_collection function
The datadir keeps path to directory where local sstables can be. The very same information is now kept in table's storage options (#20542). This set fixes the remaining places that still use table::config::datadir and table::dir() and removes the datadir field.
Closesscylladb/scylladb#20675
* github.com:scylladb/scylladb:
treewide: Remove table::config::datadir
distributed_loader: Print storage options, not datadir
data_dictionary: Add formatter for storage_options
test: Construct table_for_tests with table storage options
test: Generalize pair of make_table_for_tests helpers
tests: Add helper to get snapshot directory from storage options
table: snapshot_exists: Get directory from storage options
table: snapshot_on_all_shards: Get directory from storage options
For each new node added to the raft config populate its ID to IP mapping
in raft address map from the gossiper. The mapping may have expired if a
node is added to the raft configuration long after it first appears in
the gossiper.
Fixesscylladb/scylladb#20600
Backport to all supported versions since the bug may cause bootstrapping failure.
Closesscylladb/scylladb#20601
* github.com:scylladb/scylladb:
test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
group0: make sure that address map has an entry for each new node in the raft configuration
Parser's state was not reset when allocating section was retried.
This doesn't cause problems in practice, because reserves are enough
to cover allocation demands of parsing clustering keys, which are at
most 64K in size. But it's still potentially unsafe and needs fixing.
After commit d16ea0af, compiling the server using cmake fails with the
following error :
```
FAILED: service/CMakeFiles/service.dir/Dev/qos/service_level_controller.cc.o
...
/home/Scylla/scylladb/cql3/util.hh:21:10: fatal error: 'cql3/CqlParser.hpp' file not found
21 | #include "cql3/CqlParser.hpp"
| ^~~~~~~~~~~~~~~~~~~~
1 error generated.
```
Fix it by linking the cql3 to the service library.
Closesscylladb/scylladb#20805
Maintainers use scripts/pull_github_pr.sh from scylladb.git when merging PRs and before pushing to the next. We want to prevent merges from piling up on top of unstable builds. This change will check Gating's current status and notify the maintainers
Related to scylladb/scylla-pkg#3644Closesscylladb/scylladb#20742
When users create a table using the Alternator API, they can decide if the billing is PROVISIONED of PAY_PER_REQUEST.
If the billing is set to PROVISIONED, they need to set the ProvisionedThroughput ReadCapacityUnits (RCU) and WriteCapacityUnits (WCU).
This series adds support for getting and setting the ProvisionedThroughput. The values will be stored as table extension tags.
Following how TTL is stored within the Alternator, we will use ```system:rcu_attribute``` and ```system:wcu_attribute``` for the labels.
The series adds a test that sets ProvisionedThroughput and validates that it gets the value back. It was tested with both Alternator and AWS.
This series is part of the effort to monitor, limit, and bill Alternator operations.
New code, no need to backport.
Closesscylladb/scylladb#20056
* github.com:scylladb/scylladb:
docs/alternator/compatibility.md: explain the consumed capacity provisioned
Add test/alternator/test_provisioned_throughput.py
test/alternator/util.py: Allow override BillingMode
alternator/executor.cc: Store ProvisionedThroughput
A global index has a primary key of the form
(indexed_column, token, partition_key_column..., clustering_key_column...)
The primary key columns are used to point at the base table row, and
the token (computed as token(partition_key_column...) is used to maintain
sort order.
The query planner has an optimization: if the partition key is fully
constrained to a unique value, then we compute the token from the partition
key and use that to seek directly into the clustering row range for
that base table partition. If the clustering key is also partially
constrained, it is used to refine the index clustering key.
Currently, this optimization is implemented as a hack: the partition key
is extracted from the prepared statement + query options in
get_global_index_token_clustering_ranges(), then used to calculate
the token, which is then substituted in the expression passed to
get_single_column_clustering_bounds() (the expression is shared across
all running queries, so this is quite dangerous).
We simplify the whole thing:
- Let prepare_index_global() recognize that if the partition key is not
fully constrained, then there is no way that we'll be able to compute
the token (as it needs all partition key columns). Since the token
is the first clustering key column of the index table, we can truncate
it to length zero and bail out.
- Otherwise, the partition key is fully constrained. We refactor the
predicate (pk1 = :a AND pk2 = :b) to (pk1, pk2) := (:a, :b). We then
pass expressions representing the partition key to the token function,
ending up with token(:a, :b). We then substitute this expression into
(*_idx_tbl_ck_prefix)[0], which computes the first clustering key
column for the index table.
- Remove the runtime component in get_global_index_clustering_ranges().
Note this include the early return if the partition key wasn't fully
constrained (though the comment only mentions over-constraining), and
the token computation, which is now done by evaluate().
Closesscylladb/scylladb#20733
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
Fixes: scylladb/scylladb#20629
Need to be backported since this is a regression
Closesscylladb/scylladb#20743
* github.com:scylladb/scylladb:
test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
topology coordinator:: mark node as being replaced earlier
topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
What it called "leader" is actually the destination of the RPC.
Trivial fix, should be backported to all affected versions.
Closesscylladb/scylladb#20789
before this change, `config_file::set_value()` and
`config_file::set_value_on_all_shards()` provide default value for
`config_source`. but the default value is never used -- we alway
specify the `source_source` when calling `set_value_on_all_shards()`.
so in hope to improve the readability, the default value is removed.
so, for example, one can figure out when `config_source::Internal` is
used with less efforts. despite that `config_file::set_value()` is not
used in the tree. for the sake of completeness, its default value is
also dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20728
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.
Fixes https://github.com/scylladb/scylladb/issues/20636
---
this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt.
Closesscylladb/scylladb#20661
* github.com:scylladb/scylladb:
backup_task: fix the indent
treewide: add "table" parameter to "backup" API
We found that --clang-build-mode INSTALL_FROM tries to rebuild clang
even we use an archive of prebuilt image.
Seems like it is because ninja detected changes on standard library
headers, which updated when we build new frozen toolchain container
image.
To avoid such unnecessary rebuild, we should stop archive whole clang
build directory, we should archive install image instead.
To do so, we can use
"DESTDIR=<sysroot dir> ninja install-distribution-stripped", and archive
sysroot dir as clang archive.
Fixes#20421Closesscylladb/scylladb#20422
This commit modifies the Features page in the following way:
- It adds a short introduction and descriptions to each listed feature.
- It hides the ToC (required to control and modify the information on the page,
e.g., to add descriptions, have full control over what is displayed, etc.)
- Removes the info about Enterprise features (following the request not to include
Enterprise info in the OSS docs)
Fixes https://github.com/scylladb/scylladb/issues/20617
Blocks https://github.com/scylladb/scylla-enterprise/pull/4711Closesscylladb/scylladb#20635
Currently, node ops tasks type is retrieved from topology_request
without any change. Use respective node operation name instead.
Closesscylladb/scylladb#20671
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.
Fixesscylladb/scylladb#20791
Should be backported to all branches containing this test to reduce
flakiness.
Closesscylladb/scylladb#20792
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.
in this change:
* api/storage_service: add "table" parameter to "backup" API.
* snapshot_ctl: compose the full path of the snapshot directory in
`snapshot_ctl::start_backup`. since we have all the information
for composing the snapshot directory, and what the `backup_task_impl`
class is interested is but the snapshot directory, we just pass
the path to it instead the individual components of the directory.
* backup_task_impl: instead of scan the whole keyspace recursively,
only scan the specified snapshot directory.
Fixesscylladb/scylladb#20636
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Auth has been managed via Raft since Scylla 6.0. Restoring data
following the usual procedure (1) is error-prone and so a safer
method must have been designed and implemented. That's what
happens in this PR.
We want to extend `DESC SCHEMA` by auth and service levels
to provide a safe way to backup and restore those two components.
To realize that, we change the meaning of `DESC SCHEMA WITH INTERNALS`
and add a new "tier": `DESC SCHEMA WITH INTERNALS AND PASSWORDS`.
* `DESC SCHEMA` -- no change, i.e. the statement describes the current
schema items such as keyspaces, tables, views, UDTs, etc.
* `DESC SCHEMA WITH INTERNALS` -- does the same as the previous tier
and also describes auth and service levels. No information about
passwords is returned.
* `DESC SCHEMA WITH INTERNALS AND PASSWORDS` -- does the same
as the previous tier and also includes information about the salted
hashes corresponding to the passwords of roles.
To restore existing roles, we extend the `CREATE ROLE` statement
by allowing to use the option `WITH SALTED HASH = '[...]'`.
---
Implementation strategy:
* Add missing things/adjust existing ones that will be used later.
* Implement creating a role with salted hash.
* Add tests for creating a role with salted hash.
* Prepare for implementing describe functionality of auth and service levels.
* Implement describe functionality for elements of auth and service levels.
* Extend the grammar.
* Add tests for describe auth and service levels.
* Add/update documentation.
---
(1): https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/backup-restore/restore.html
In case the link stops working, restoring a schema was realised
by managing raw files on disk.
Fixesscylladb/scylladb#18750Fixesscylladb/scylladb#18751Fixesscylladb/scylladb#20711Closesscylladb/scylladb#20168
* github.com:scylladb/scylladb:
docs: Update user documentation for backup and restore
docs/dev: Add documentation for DESC SCHEMA
test: Add tests for describing auth and service levels
cql3/functions/user_function: Remove newline character before and after UDF body
cql3: Implement DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS
auth: Implement describing auth
auth/authenticator: Add member functions for querying password hash
service/qos/service_level_controller: Describe service levels
data_dictionary: Remove keyspace_element.hh
treewide: Start using new overloads of describe
treewide: Fix indentation in describe functions
treewide: Return create statement optionally in describe functions
treewide: Add new describe overloads to implementations of data_dictionary::keyspace_element
treewide: Start using schema::ks_name() instead of schema::keyspace_name()
cql3: Refactor `description`
cql3: Move description to dedicated files
test: Add tests for `CREATE ROLE WITH SALTED HASH`
cql3/statements: Restrict CREATE ROLE WITH SALTED HASH
auth: Allow for creating roles with SALTED HASH
types: Introduce a function `cql3_type_name_without_frozen()`
cql3/util: Accept std::string_view rather than const sstring&
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.
So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.
This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.
It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.
Fixes#20626.
Closesscylladb/scylladb#20737
* github.com:scylladb/scylladb:
tablet: Fix single-sstable split when attaching new unsplit sstables
replica: Fix tablet split execute after restart
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.
This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#20271Closesscylladb/scylladb#20500
We update the relevant articles addressing backing-up
and restoring the schema by specifying that the user
performing it must be a superuser. We also update
the required version of cqlsh.
Additionally, we add an article covering the fundamental
information on `DESCRIBE SCHEMA`.
We add documentation for developers addressing
`DESCRIBE SCHEMA`. It covers the following aspects
of it:
* motivation,
* synopsis of the solution,
* implementation of the solution,
as well as a few subsections explaining the details:
* restoring process and its side effects,
* restoring roles with passwords,
* list of statements generated by `DESC SCHEMA`
with examples,
* implementation details.
We add tests verifying the following features work correctly:
* describing auth: roles, role grants, granting permissions on
resources,
* describing service levels: creating them and attaching to roles.
We remove newline characters that are printed before and after
a UDF's body. This way, we want to keep the create statement
as close to what was actually provided as possible. Although
there should be no semantic differences with or without the
newline characters, it's a lot more convenient in testing when
they're not present.
Fixesscylladb/scylladb#20711
When executing `DESC SCHEMA WITH INTERNALS`, Scylla now also returns
statements that can be used to recreate service levels and restore
the state of auth. That encompasses granting roles and permissions
as well as attaching service levels to roles.
If the additional parameter `WITH PASSWORDS` is provided,
the statements corresponding to recreating roles in the system
will also contain the stored salted hashes.
We introduce a function `describe_auth()` in `auth::service`
responsible for producing a sequence of descriptions whose
corresponding CQL statement can be used to restore the state
of auth.
Add a README.md in test/boost, giving a short introduction to what this
directory is and what kind of tests it contains, and how to run individual
tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20550
Now all its users are in boost suite. Once moved, the
stress_collection() function no longer runs in seastar thread, and the
in_thread argument is removed while the function is moved.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This time the boost test needs to stop being pure-boost test, since
bptree compaction test case needs to run in seastar thread. Other
collection tests are already such, not bptree_test joins the party.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This test must run in seastar thread, so put it in seastar-thread test
case, fortunately btree test allows that. Just like its stress peer,
this test also has two invocations from suite, so make it two distinct
test cases as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This also moves the code, but takes into account the stress test had two
invovations with suite options -- small and large. Inherit both with two
distinct test cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just move the code. And hard-code the "scale" (i.e. -- number of keys
and iterations) from default arguments of the unit test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This code is going to be shared between seastar thread and boost tests,
temporarily. So not to yield in pure boost test, add the switch. It will
be removed really soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch doesn't yet change how schema merging works but it prepares the ground for it by simplifying the code and separating merging logic into its own unit.
It consists of:
- minor cleanups of unused code
- moving code into separate file
- simplifying merge_keyspaces code
More detailed explanation in per commit messages.
Relates scylladb/scylladb#19153Closesscylladb/scylladb#19687
* github.com:scylladb/scylladb:
db: schema_applier: simplify merge_keyspaces function
db: schema_applier: remove unnecessary read in merge_keyspaces
db: schema_tables: move scylla specific code into create keyspace function
db: move schema merging code into a separate unit
db: schema_tables: export some schema management functions
replica: remove unused table_selector forward declaration
db: remove unused flush arg from do_merge_schema func
db: remove unused read_arg_values function
`unspecified` workload type is an internal value and it's not exposed to
user via CQL.
Default value for workload type from user's perspective is `NULL`.
Fixesscylladb/scylladb#20780
Maintainers use scripts/pull_github_pr.sh from scylladb.git when merging PRs and before pushing to the next. We want to prevent merges from piling up on top of unstable builds. This change will check Gating's current status and notify the maintainers
Related to https://github.com/scylladb/scylla-pkg/issues/3644Closesscylladb/scylladb#20742
Because it is never such -- the only caller of truncate_blocking()
always knows the timeout it want this method to use.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20620
Change order of functions: firstly remount, then change ownership for
cgroup. It was not failing before because with privileged mode, it will
mount cgroups as RW, but it's better to have this check if behavior will
change.
Closesscylladb/scylladb#20676
for better readability.
read_config() is not on the critical path, so the performance
degradation caused by C++20 couroutine is neglectable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20694
The test configures write timeout to much smaller value to make the test
run faster since for some writes sleep is inserted to hit the timeout,
but it makes aarch64 debug flaky since timeout happens when it should
not because of a natural slowness.
Fixesscylladb/scylladb#20515Closesscylladb/scylladb#20744
This one is aimed at giving tests the ability to call private methods of class sstable. Some of the wrappers in the test class wrap public methods and can be removed.
Closesscylladb/scylladb#20614
* github.com:scylladb/scylladb:
test: Remove sstables::test::binary_search()
test: Remove sstables::test::move_summary()
test: Remove sstables::test::read_toc()
test: Remove sstables::test::get_summary()
test: Remove sstables::test::get_statistics()
test: Remove sstables::test::data_read()
Add for both x86_64 compilation flags for clang, to get it compile with newer arch x86_64-v3 for x86 and ARM 8.2 level for aarch64.
Tested to compile fine with both clang 18.1.6 and 18.1.8.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#20682
This change allows the user to fully set the page size for the query.
There's still an internal hard-limit of 1MB anyway, so there's no need
to limit it to our default value (because using a larger page size might
be a query optimization sometimes)
Fixes#20612Closesscylladb/scylladb#20692
This one is pretty simple
```
return do_with(std::move(data), [] {
toss_data(data);
return remove(std::move(data));
});
```
it doesn't really need to do_with() since "toss_data" is non-preemptive. Still, convert it into
```
toss_data(data);
co_await remove(std::move(data));
```
Closesscylladb/scylladb#20479
* github.com:scylladb/scylladb:
sstables: Restore indentation after previous patch
sstables: Coroutinize remove_unshared_sstables()
to explain for instance which setting takes effect if both
command line options and `scylla.yaml` configures the same parameter.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20696
The sstables header contains a forward declaration for
`random_access_reader`. This was introduced in 75dc7b799e for no
obvious reason.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The sstables header contains a forward declaration for
`metadata_collector`. This was introduced in 2d6608bb88 for the return
value of the `sstable_writer::get_metadata_collector()`. This function
was later removed in 9e7144f719 but the forward declaration was left
behind.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The sstables header contains a forward declaration for
`sstable_writer_v2`. This was introduced in fed5b73147 but never used.
It is probably a leftover from a previous revision of the patchset.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The sstables header contains a forward declaration for `key`.
This was introduced in 198f55dc5c for a reference parameter in
`binary_search()`.
The function was eventually moved to a different header in 4ed7e529db
but the forward declaration was left behind.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Every time the ninja buildfile decides it needs to be updates, it calls
the configure.py script with roughly the same set of flags. However, the
--out-final-name flag is improperly handled and, on each reconfigure,
one more --out-final-name flag is appended to the rebuild command. This
is harmless because each instance of the flag will specify the same
parameter, but slightly annoying because it bloats the generated file
and the duplicated flags show up in ninja's output when reconfigure
runs.
Fix the problem by stripping the --out-final-name flags from the set of
the flags passed to the configure.py before forwarding them to the
reconfigure rule.
Closesscylladb/scylladb#20731
We add new member functions to the interface of `auth::authenticator`
responsible for querying the password hash corresponding to a given
role. One method indicates whether a given authenticator uses
password hashes, while the other queries them or throws an exception
password hashes are not used.
The rationale for extending the interface of authenticator is
to be able to access salted hashes from other parts of auth.
We will need them in an upcoming commit responsible for describing
auth.
in 3cd2a61736, we dropped scylla-jmx
from the build. but didn't update the CMake building system accordingly,
this broke the CMake build, as the dependencies pointing to jmx cannot
be found or fulfilled.
in this change, we remove all references to jmx in the CMake build.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#20736
- removes uneccesary temporary sets/vectors
- removes auto&&
- moves return value instead of copying
- instead adds diff references to keep readability
- create and alter logic is almost the same, now it's visible better
read_schema_partition_for_keyspace() is already called for every
changing keyspace by get_schema_complete_view() and stored in _after
field so we can reuse this data.
Since extract_scylla_specific_keyspace_info() was always coupled
with create_keyspace_from_schema_partition() there is no value
in separating them. By moving first into the latter we:
- reduce number of exported functions
- simplify arguments of create_keyspace_from_schema_partition
- simplify caller's code
It's mostly self containted and it's easier to
maintain reasonably sized files. Also splitting
better shows boundaries between schema and
schema merging code.
In subseqent commits schema merging code will be separated from
db/schema_tables.cc but code which manages schema will remain intact.
So those two translation units will share some amount of code.
It's similar case as with replica/database.cc which creates schema
on startup, it calls functions from db/schema_tables.cc.
Struct qualified_name got moved to header as it's used as
read_table_mutations() argument.
This PR addresses multiple issues with alternator batch metrics:
1. Rename the metrics to scylla_alternator_batch_item_count with op=BatchGetItem/BatchWriteItem
2. The batch size calculation was wrong and didn't count all items in the batch.
3. Add a test to validate that the metrics values increase by the correct value (not just increase). This also requires an addition to the testing to validate ops of different metrics and an exact value change.
Needs backporting to allow the monitoring to use the correct metrics names.
Fixes#20571Closesscylladb/scylladb#20646
* github.com:scylladb/scylladb:
alternator:test_metrics test metrics for batch item count
alternator:test_metrics Add validating the increased value
alternator: Fix item counting in batch operations
Alterntor rename batch item count metrics
This commit addresses an issue where accessing the raw pointer of the
schema instance within `table::_schema` using `table.schema._p` was
unreliable.
before this change, `_p` was of type `lw_shared_ptr_counter_base*`, a
type-erased smart pointer, preventing direct casting to the underlying
schema pointer. but we still cast it to `schema*` anyway. this led to
a gdb.MemoryError when dereferencing the deduced pointer:
but the type of `_p` is `lw_shared_ptr_counter_base*`, which is a
type erased smart pointer, and it cannot be casted directly to the
under pointer pointing to a `schema` instance. this results in:
```
Traceback (most recent call last):
File "/home/avi/scylla/test/scylla_gdb/../../scylla-gdb.py", line 5554, in invoke
self.print_key_type(seastar_lw_shared_ptr(schema['_clustering_key_type']).get().dereference(), 'clustering')
File "/home/avi/scylla/test/scylla_gdb/../../scylla-gdb.py", line 5533, in print_key_type
key_type = seastar_shared_ptr(key_type).get().dereference()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.MemoryError: Cannot access memory at address 0x4000079656b0078
```
when we are dereferencing the raw pointer deduced this way.
in this change,
* we use the wrapper of `seastar_lw_shared_ptr` to safely obtain the
raw pointer.
* reenable this test previously disabled by 3d781c4f
tested using
```console
$ SCYLLA=/home/kefu/dev/scylladb/master/build/release/scylla \
test/scylla_gdb/run -o junit_suite_name=scylla_gdb test_misc.py::test_schema
```
on an up-to-date fedora 40 installation.
Refs 3d781c4fFixesscylladb/scylladb#20741
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20746
Move all of the blatantly restriction-related expression utilities
to statement_restrictions.cc.
Some are so blatant as to include the word "restriction" in their name.
Others are just so specialized that they cannot be used for anything else.
The motivation is that further refactoring will be simplified if it can
happen within the same module, as there will not be a need to prove
it has no effect elsewhere.
Most of the declarations are made non-public (in .cc file) to limit
proliferation. A few are needed for tests or in select_statement.cc
and so are kept public.
Other than that, the only changes are namespace qualifications and
removal of a now-duplicate definition ("inclusive").
Closesscylladb/scylladb#20732
* tools/java e505a6d3bb...5b0e274f12 (1):
> Merge 'build.xml: install and use java-11 when building' from Kefu Chai
Updates to clang 18.1.8 + LLVM patch to match Fedora 40.
New optimized clang build generated and stored in
https://devpkg.scylladb.com/clang/clang-18.1.8-x86_64.tar.gzhttps://devpkg.scylladb.com/clang/clang-18.1.8-aarch64.tar.gz
Due to the loss of the jmx submodule, we no longer install java-11-openjdk.
We add it in install-dependencies.sh here to compensate, pending a better
solution.
tools/java submodule updated to remove build failure where Java 8
was selected instead of Java 11.
The scylla_gdb test suite was disabled due to a regression in gdb 15,
which is brought in by the toolchain update [1].
[1] https://github.com/scylladb/scylladb/issues/20741.
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.
So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.
This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.
It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.
Fixes#20626.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
let's assume there are 2 nodes, n1, n2. n1 is the coordinator.
1) n1 emits split
2) n1 and n2 complete split work
3) n1 becomes aware all replicas are ready for split
4) n2 restarts, but places split sstable into main group[1]
5) n1 executes split
6) n2 handles split completion, but see the main group is not empty
[1]: During split, main group should only contain unsplit sstables.
If all sstables are split, main must be empty.
This is a result of replica not setting storage group to split mode on restart
(using tablet map) and therefore sstables are incorrectly placed on main group.
The fix is about looking at tablet map and setting group to split mode before
sstables are populated into it.
Refs #20626.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Old nodetool requested `/storage_service/tokens_endpoing` first, then
`/storage_service/host_id`, while the native nodetool did it in reverse
order. Most of the time this is inconsequential but there is an edge
case when a node's IP address is changed. This reversing of the order
results in unexpected behavior for tests, causing noise via flaky
tests.
Match the order of the old nodetool so that the native nodetool exhibits
the behavior expected by tests (and users too probably).
Fixes: scylladb/scylladb#18693Closesscylladb/scylladb#20615
The interface is not used anywhere anymore, so we can
remove it safely. It has been replaced by custom
functions for each keyspace element and `cql3::description`.
We continue removing `data_dictionary::keyspace_element`.
In this commit, we start using the overloads returning
`cql3::description` in places where the methods specified
by `data_dictionary::keyspace_element` were used.
We add a new parameter in functions used to generate instances
of `cql3::description` for types related to situations where we
might not need a create statement. An example of such a scenario
could be `DESCRIBE TYPES`.
We're removing `data_dictionary::keyspace_element`.
Before we can do that, we need to substitute the existing
methods used for describing keyspace elements with their
new versions returning `cql3::description`.
That's what happens in this commit.
We're going to remove the interface `data_dictionary::keyspace_element`.
As `schema::keyspace_name()` is an implementation of one of the methods
specified by that interface, we replace its uses by `schema::ks_name()`.
`schema::keyspace_name()` was an alias for it, so no semantic change
has occured.
In these changes, we describe the purpose of the type
and make it reusable for other parts of the code.
That includes ditching the existing constructors,
leaving the formatting of its fields to the user
of the interface.
The removed constructors have been replaced by
free functions so that existing code can still
use them the way it did before.
We move the declaration of `description` to dedicated files
to be able to create instances of it from other parts of
the code.
`describe_statement.cc` has been functioning as an
intermediary between objects that can be described
and the end user. It will still perform that duty,
but we want to let other modules be able to generate
descriptions on their own, without having to share
an additional layer of abstraction in form of types
inheriting from `data_dictionary::keyspace_element`.
Those types may not perform any other function than
that and thus may be redundant.
Adjusting `description` to its new purpose will happen
in an upcoming commit.
We start requiring that the user issuing `CREATE ROLE
WITH SALTED HASH` be a superuser. The rationale for
that is the statement directly modifies a system
tables, circumventing the hashing algorithm.
Additionally, we correct a possible existing problem.
`_options.is_superuser` in `create_role_statement`
may be an empty optional, so dereferencing it
without a prior check could lead to undefined
behavior in the future.
We introduce a way to create a role with explictly
provided salted hash.
The algorithm for creating a role with a password works
like this:
1. The user issues a statement `CREATE ROLE <role> WITH
PASSWORD = '<password>' <...>`.
2. Scylla produces a hash based on the value of
`<password>`.
3. Scylla puts the produced hash in `system.roles`,
in the column `salted_hash`.
The newly introduced way to create a role is based
on a new form of the create statement:
`CREATE ROLE <role> WITH SALTED HASH = '<salted_hash>`
The difference in the algorithm used for processing
this statement is that we insert `<salted_hash>`
into `system.roles` directly, without hashing it.
The rationale for introducing this new statement is that
we want to be able to restore roles. The original password
isn't stored anywhere in the database (as intended),
so we need to rely on the column `salted_hash`.
The introduced function returns the actual name
of the type represented by `abstract_type`.
It circumvents name processing like wrapping a type
within `frozen<>` or using Cassandra's syntax.
We add the function to be able to describe UDFs
in the upcoming commits that require that their
arguments not be `frozen<>`.
We also test the implementation.
This reverts commit 44bd183187 and
moves the base directory back on filesystem_storage. The mentioned
commit says
> so we can use the base (table) directory for
> e.g. pending_delete logs, in the next patch.
but "next patch" doesn't use it outside of the filesystem-storage
anyway.
This field doesn't make sense for S3 backend. Its "location" is not
location, but a key in the system.sstables, which should rather be
schema ID, not /var/lib/.../keyspace/table-uuid string.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20642
For the benefit of running test.py inside CI, we recently added to
test/cql-pytest and test/alternator the knowledge of which "Scylla mode"
(--mode) and "run number" is running (--run_id), although these concepts
are alien to these two test frameworks (remember that those test frameworks
can also run tests against unknown versions of Scylla or even our competitors'
implementations).
One unfortunate result of this change is that now if you run a test by
using pytest directly (or test/*/run) instead of test.py, for example:
$ cd test/alternator
$ pytest --aws test_item.py::test_basic_string_put_and_get
The test's success or failure reports the ugly name
test_item.py::test_basic_string_put_and_get.no_mode.1
This unnecessary "no_mode.1" come from the the default values for --mode
and --run_id, respectively. But there is no reason for these silly
defaults. In this patch we change these defaults to None, and when they
are None, they aren't tacked onto the test's name.
This patch shouldn't affect running tests through test.py, because
test.py always sets the --mode and --run_id options, and doesn't leave
them as the default.
Fixes#20512
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20513
Drop the unused `gms::inet_address::raw_addr` method
and modernize operator== and operator< as class methods
* Cleanup only, no backport needed
Closesscylladb/scylladb#20681
* github.com:scylladb/scylladb:
gms: inet_address: modernize comparison operators
gms: inet_address: drop unused raw_addr method
Allow to specify service level used in select statement `SELECT ... USING SERVICE LEVEL sl_name`.
In OSS, this only affects statement's timeout.
In case both service level and timeout are specified `SELECT ... USING SERVICE LEVEL sl_name AND TIMEOUT 1h`, the timeout has higher priority as statement's timeout.
Fixesscylladb/scylladb#18471Closesscylladb/scylladb#20523
* github.com:scylladb/scylladb:
test/cql-pytest: add test for `SELECT ... USING SERVICE LEVEL`
cql3/Cql.g: extend grammar to allow `SELECT ... USING SERVICE LEVEL`
cql3/statements/select_statement: use service level timeout
cql3/attributes: add service level name field
qos/service_level_controller: add method to check if service level exists in cache
The statement_restrictions class started life in the object-oriented style - an
object that interacts with its environment via mutators and is observed via
observers.
This is however not suitable for its objective: to analyze the WHERE clause,
select a query plan, and partition the WHERE clause atoms to the various
parts demanded by the query plan (read_command and filters). Furthermore,
the object oriented style makes it hard to work with as you can only call some
observers after the related mutators were called.
Fix this by transforming the code info a more functional style: we call
a function that returns an immutable statement_restrictions object that
can only be observed. This makes it easier to further change in the future,
as changes will not have to consider interaction with the environment.
No backport as this is a refactoring
Closesscylladb/scylladb#20672
* github.com:scylladb/scylladb:
cql3: statement_restrictions: use functional style
cql3: statement_restrictions: calculate the index only once
cql3: statement_restrictions: make it a const object
use `seastar::handle_signal()` instead of `reactor::handle_signal()`.
in a recent change in seastar (c3e826ad1197f2610138f3bcfaeb0b458f8fb799),
the later was marked as deprecated in favor of the former, so let's
use the recommended API.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20695
instead grouping tests with different parameters, let's parameterize
them using `BOOST_DATA_TEST_CASE()`, simpler this way. and the tests
can be more structured.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20697
During a query execution, the query can be re-bounced to another shard if the requested data is located there. Previous implementation assumed that the shard cannot be changed after first re-bounce, however with the introduction of Tablets, data could be migrated to another shard after the query was already re-bounced, causing a failure of the query execution. To avoid this issue, the query is re-bounced as needed until it is executed on the correct shard.
Fixes#15465Closesscylladb/scylladb#20493
* github.com:scylladb/scylladb:
cql_server: Add a test for multiple query msg rebounces.
cql_server::connection: process: rebounce msg if needed
cql_server::connection: process: co-routinize connection::process_on_shard
cql_server: connection: process: fixup indentation
cql_server: connection: process_on_shard: drop permit parameter
transport: server: pass bounce_to_shard as foreign shared ptr
cql_server: connection: process: add template concept for process_fn
cql_server: move process_fn_return_type to class definition
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.
The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.
However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.
Refs scylladb/scylladb#17575
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20375
It's write-only now, all the places than wanted to know where table's
storage is, already use storage_options.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When populating keyspace on boot the dist. loader prints a debugging
message with ks:cf names, state and the directory from where it picks
sstables. The last one is not extremely correct, as loading sstables
from S3 happens from a bucket, not directory. So it's better to print
the storage options, not the datadir string.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only place that constructs table_for_tests is make_table_for_tests
helper. It can and should prepare the correct storage options, because
that's the last place where the target directory is still known.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
They only differ in a way they get target directory from -- one via
argument, andother from test_env. Respectively, the latter can call the
former.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a bunch of tests that check the contents of snapshot directory
after creating one. Add a helper for those that gets this directory via
storage options, not table config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to snapshot_on_all_shards, the way snapshot directory is
evaluated is changed to rely on storage options. Two ... assumptions are
that when asking for non-local snapshot existance or for a snapshot of a
virtual table, it's correct to return false instead of throwing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several things that are changed here
- The target directory for snapshot is evaluated using table directory
taken from its storage options, not from config
- If the storage options are not "local", the snapshot_on_all_shards is
failed early, it's impossible to snapshot sstables anyway
- If the storage is not configured for the obtained local options,
snapshotting is skilled, because it's a virtual table that's probably
not supposed to have snapshots
- The late failure to snapshot non-local sstables is converted into
internal error, as this functionality cannot be executed as per
previous change
- The target path is created using fs::path operator/ overload, not by
concatenating strings (it's minor change)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Refs #20686
Refs #15607
In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.
Closesscylladb/scylladb#20691
* Also dump diagnostics when a read times out while active (not queued).
* Add the "Trigger permit" line, containing the details of the permit which caused the diagnostics dump (by e.g. timing out).
* Add the "Identified bottleneck(s)" line, containing the identified bottlenecks which lead to permits being queued. This line is missing if no such bottleneck can be identified.
* Document the new features, as well as the stat dump, which was added some time ago.
Example of the new dump format:
```
INFO 2024-09-12 08:09:48,046 [shard 0:main] reader_concurrency_semaphore - Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/10 count and 106192275/32768 memory resources: timed out, dumping permit diagnostics:
Trigger permit: count=0, memory=0, table=ks.tbl0, operation=mutation-query, state=waiting_for_admission
Identified bottleneck(s): memory
permits count memory table/operation/state
3 2 26M *.*/push-view-updates-2/active
3 2 16M ks.tbl1/push-view-updates-1/active
1 1 15M ks.tbl2/push-view-updates-1/active
1 0 13M ks.tbl1/multishard-mutation-query/active
1 0 12M ks.tbl0/push-view-updates-1/active
1 1 10M ks.tbl3/push-view-updates-2/active
1 1 6060K ks.tbl3/multishard-mutation-query/active
2 1 1930K ks.tbl0/push-view-updates-2/active
1 0 1216K ks.tbl0/multishard-mutation-query/active
6 0 0B ks.tbl1/shard-reader/waiting_for_admission
3 0 0B *.*/data-query/waiting_for_admission
9 0 0B ks.tbl0/mutation-query/waiting_for_admission
2 0 0B ks.tbl2/shard-reader/waiting_for_admission
4 0 0B ks.tbl0/shard-reader/waiting_for_admission
9 0 0B ks.tbl0/data-query/waiting_for_admission
7 0 0B ks.tbl3/mutation-query/waiting_for_admission
5 0 0B ks.tbl1/mutation-query/waiting_for_admission
2 0 0B ks.tbl2/mutation-query/waiting_for_admission
8 0 0B ks.tbl1/data-query/waiting_for_admission
1 0 0B *.*/mutation-query/waiting_for_admission
26 0 0B permits omitted for brevity
96 8 101M total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 0
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 1
reads_enqueued_for_admission: 82
reads_enqueued_for_memory: 0
reads_admitted_immediately: 1
reads_queued_because_ready_list: 0
reads_queued_because_need_cpu_permits: 82
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 97
current_permits: 96
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
```
Fixes: https://github.com/scylladb/scylladb/issues/19535
Improvement, no backport needed.
Closesscylladb/scylladb#20545
* github.com:scylladb/scylladb:
docs/dev/reader-concurrency-semaphore.md: update the documentation on diagnostics dumps
test/boost/reader_concurrency_semaphore_test: test the new diagnostics functionality
reader_concurrency_semaphore: add bottleneck self-diagnosis to diagnosis dump
reader_concurrency_semaphore: include trigger permit in diagnostic dump
reader_concurrency_semaphore: propagate permit to do_dump_reader_permit_diagnostics()
reader_concurrency_semaphore: use consistent exception type for timeout
reader_concurrency_semaphore: dump diagnostics when non-waiting reader times out
So the table is not dropped while the query is ongoing.
query() already does this but using old-fashioned enter()+leave(),
convert it to use the new RAII helper.
Closesscylladb/scylladb#20583
The main goal of this PR is to fix a bug (#20619) in the alternator_enforce_authorization=false setting - which didn't do its job (i.e, _don't_ check permissions) when authorization is configured in CQL but not wanted in Alternator.
The series also a few smaller bugs in the code that were discovered while debugging the main issue:
1. A potential use-after-free (that didn't seem to hit us in practice) is fixed.
2. A confusing error message (that was also reported in #20619) is improved.
3. Make the alternator_enforce_authorization live-updatable. There was no reason why it shouldn't be, and as this series needs to make this flag available to more code, let's just do it properly and assume the flag is live-updatable.
Because the RBAC feature has not been backported to any open-source branches, neither should these fixes. But if some private branch received a backport of the RBAC feature, it should get these fixes too.
Fixes#20619.
Closesscylladb/scylladb#20640
* github.com:scylladb/scylladb:
alternator: make alternator_enforce_authorization live-updateable
alternator: fix alternator_enforce_authorization=false
alternator: improve error message when unauthenticated
alternator: avoid use-after-free in RBAC
* seastar ec5da7a6...69f88e2f (38):
> build: s/Sanitizers_COMPILER_OPTIONS/Sanitizers_COMPILE_OPTIONS
> test: Update httpd test with request/reply body writing sugar
> http: Add sugar to request and response body writers
> utils: Add util::write_to_stream() helper
> seastar-addr2line: adjust llvm termination regex
> README.md: add Crimson project
> rpc: conditionally use fmt::runtime() based on SEASTAR_LOGGER_COMPILE_TIME_FMT
> build: check the combination of Sanitizers
> tls: clear session ticket before releasing
> print: remove dead code
> doc/lambda-coroutine-fiasco: reword for better readability
> rpc: fix compilation error caused by fmt::runtime()
> tutorial: explain the use case of rethrow_exception and coroutine::exception
> reactor: print more informative error when io_submit fails
> README.md: note GitHub discussions
> prometheus: `fmt::print` to stringstream directly
> doc: add document for testing with seastar
> seastar/testing: only include used headers
> test: Add abortable http client test cases
> http/client: Add abortable make_request() API method
> http/client: Abort established connections
> http/client: Handle abort source in pool wait
> http/client: Add abort source to factory::make() method
> http/client: Pass abort_source here and there
> http/client: Idnentation fix after previous patch
> http/client: Merge some continuations explicitly
> signal: add seastar signal api
> httpd: remove unused prometheus structs
> print: use fmtlib's fmt::format_string in format()
> rpc: do not use seastar::format() in rpc logger
> treewide: s/format/seastar::format/
> prometheus: sanitize label value for text protocol
> tests: unit test prometheus wire format
> io-tester: Introduce batches to rate-based submission
> io-tester: Generalize issueing request and collecting its result
> io-tester: Cancel intent once
> io-tester: Dont carry rps/parallelism variables over lambdas
> io-tester: Simplify in-flight management
The breaking changes in the seastar submodule necessitate corresponding
modifications in our code. These changes must be implemented together in
a single commit to maintain consistency. So that each commit is buildable.
following changes are included in addition to seastar submodule update:
* instead of passing a `const char*` for the format string, pass a
templated `fmt::format_string<...>`, this depends on the
`seastar::format()` change in seastar.
* explicitly call `fmt::runtime()` if the format string is not a
consteval expression. this depends on the `seastar::format()` change
in seastar. as `seastar::format()` does not accept a plain
`const char*` which is not constexpr anymore.
* pass abort_source to `dns_connection_factory::make()`. this depends on
the change in seastar, which added a `abort_source*` argument to
the pure virtual member function of `connection_factory::make()`.
* call call {fmt,seastar}::format() explicitly. this is a follow up of
3e84d43f, which takes care of all places where we should call
`fmt::format()` and `seastar::format()` explicitly to disambiguate the
`format()` call. but more `format()` call made their way into the source
tree after 3e84d43f. so we need fix them as well.
* include used header in tests
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Update seastar submodule
Please enter the commit message for your changes. Lines starting
Closesscylladb/scylladb#20649
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.
This patch adds tests for the batch operations item count.
The tests validate that the metrics tracking the number of items
processed in a batch increase by the correct amount.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The `check_increases_operation` now allows override the checked metric.
Additionally, a custom validation value can now be passed, which make it
possible to validate the amount by which a value has changed, rather
than just validating that the value increased.
The default behavior of validating that values have increased remains
unchanged, ensuring backward compatibility.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch fixes the logic for counting items in batch operations.
Previously, the item count in requests was inaccurate, it count the
number of tabels in get_item and the request_items in write_items.
The new logic correctly counts each individual item in `BatchGetItem`
and `BatchWriteItem` requests.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch renames metrics tracking the total number of items in a batch
to `scylla_alternator_batch_item_count`. It uses the existing `op` label to
differentiate between `BatchGetItem` and `BatchWriteItem` operations.
Ensures better clarity and distinction for batch operations in monitoring.
This an example of how it looks like:
# HELP scylla_alternator_batch_item_count The total number of items processed across all batches
# TYPE scylla_alternator_batch_item_count counter
scylla_alternator_batch_item_count{op="BatchGetItem",shard="0"} 4
scylla_alternator_batch_item_count{op="BatchWriteItem",shard="0"} 4
branch-6.2 is already available, adding support for it in mergify to
allow backport to this new branch.
in addition, since branch 5.4 reached EOL - removing it
Closesscylladb/scylladb#20669
Fixes#20633
Cannot assert on actual request_controller when releasing permit, as the
release, if we have waiters in queue, will subtract some units to hand to them.
Instead assert on permit size + waiter status (and if zero, also controller value)
* v2 - use SCYLLA_ASSERT
Closesscylladb/scylladb#20654
repair_put_row_diff_with_rpc_stream_process_op() always returns
stop_iteration::no (or throws). Moreover, the return value is ignored
by its only caller. Simplify by returning a plain future<>.
Closesscylladb/scylladb#20610
Most of the analysis of the WHERE clause is done in statement_restrictions. It determines
what parts to use for the primary or secondary index, and what parts to use for filtering.
The difficult part is that it has a very wide interface. After construction, the user must pick
the correct bits from many public functions. There are subtle interactions between them
that are hard to untangle.
This series simplifies the interface as it is used for selection filtering. In the end, only
two public functions are used, both returning expressions: one for the partition-level
filtering, one for the clustering row level filtering.
In the end, the WHERE clause is factored into three parts:
- one part goes into the read_command of the primary or secondary index
- another part (that references only partition key columns and static key columns) is used to filter entire partitions
- another part (that currently references only clustering key columns and regular columns, but one day may reference other columns) is used to filter clustering rows
Refactoring, no backport.
Closesscylladb/scylladb#20487
* github.com:scylladb/scylladb:
cql3: statement_restrictions: drop accessors for single-column key restrictions
cql3: selection: adjust indentation
cql3: selection: delete empty loop
cql3: statement_restrictions, selection: fold multi-column restrictions into row-level filter
cql3: statement_restrictions, selection: merge clustering key filter and regular columns filter
cql3: statement_restrictions, selection: merge partition key filter and static columns filter
cql3: selection: filter regular and static rows as a single expression each
cql3: statement_restrictions: collect regular column and static column filters into single expressions
cql3: selection: filter clustering key as a single expression
cql3: statement_restrictions: expose filter for clustering key
cql3: selection: filter partition key as a single expression
cql3: statement_restrictions: expose filter for partition key
cql3: statement_restrictions: remove relations used for indexing from filtering
cql3: statement_restrictions: bail out of find_idx if !_uses_secondary_index
cql3: statement_restrictions, modification_statement: pass correct value of check_indexes
cql3: statement_restrictions: correct mismatched clustering/partition restrictions references
cql3: statement_restrictions: precalculate get_column_defs_for_filtering()
cql3: selection: do_filter(): push static/regular row glue to higher level
In https://github.com/scylladb/scylladb/pull/18729, we introduced a new statement tenant `$maintenance`, but the change wasn't protected by any cluster feature.
This wasn't a problem for OSS, since unknown isolation cookie just uses default scheduling group. However, in enterprise that leads to creating a service level on not-upgraded nodes, which may end up in an error if user create maximum number of service levels.
This patch adds a cluster feature to guard adding the new tenant. It's done in the way to handle two upgrade scenarios:
- version without `$maintenance` tenant -> version with `$maintenance` tenant guarded by a feature
- version with `$maintenance` tenant but not guarded by a feature -> version with `$maintenance` tenant guarded by a feature
The PR adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection.
The `$maintenance` tenant is added to the config as disabled and it gets enabled once the corresponding feature is enabled.
Fixesscylladb/scylladb#20070
Refs scylladb/scylla-enterprise#4403Closesscylladb/scylladb#19802
* github.com:scylladb/scylladb:
message/messaging_service: guard adding maintenance tenant under cluster feature
message/messaging_service: add feature_service dependency
message/messaging_service: add `enabled` flag to statement tenants
Instead of a constructor, use a new function
analyze_statement_restrictions() as the entry point. It returns an
immutable statement_restrictions object.
This opens the door to returning a variant, with each arm of the variant
corresponding to a different query plan.
find_idx() is called several times. Rename it do_find_idx(), call it
just once, store the results, and make find_idx() return the stored
results.
This simplifies control flow and reduces the risk that successive
calls of find_idx return different results.
Make validate_secondary_index_selections() const (it trivially is),
and call prepare_indexed_local() / prepared_indexed_global() at the
end of the constructor.
By making statement_restrictions a const object, reasoning about it
can be local (looking at the source file) rather than global (looking
at all the interactions of the class with its environment. In fact,
we might make it a function one day.
Since prepare_indexed_global()/prepare_indexed_local() only mutate
_idx_tbl_ck_prefix, which isn't mutated by the rest of the code, the
transformation is safe.
The corresponding code is removed from select_statement. The removal
isn't complete since it still uses some computation, but later
deduplication is left for another day.
The test emulates several LWT(Lightweight Transaction) query rebounces. Currently, the code
that processes queries does not expect that a query may be rebounced more than once.
It was impossible with the VNodes, but with intruduction of the Tablets, data can be moved
between shards by the balancer thus a query can be rebounced to different shards multiple times.
Rebounce the msg to another shard if needed,
e.g. in the case of tablet migration.
An example for that, as given by Tomasz Grabiec:
> Bouncing happens when executing LWT statement in
> modification_statement::execute_with_condition by returning a
> special result message kind. The code assumes that after
> jumping to the shard from the bounce request, the result
> message is the regular one and not yet another bounce.
> There is no problem with vnodes, because shards don't change.
> With tablets, they can change at run time on migration.
Fixesscylladb/scylladb#15465
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`cql_server::connection::process_on_shard` is made a co-routine to
make sure captured objects' lifetime is managed by the source shard,
avoiding error prone inter-shard objects transfers.
For no good reason, the "alternator_enforce_authorization" flag (which
chooses whether to enable authentication and authorization checks in
Alternator) was not live-updatable, so make it so.
Both "server" and "executor" objects use this configuration flag, the
former is fixed in this patch (to hold a live-updatable reference
instead of a copy of a boolean), the latter was already prepared for
this change and already held a live-updatable reference.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When the configuration has alternator_enforce_authorization=false,
Alternator should not do authentication (check which user signed each
request) nor authorization (check if that user has permissions to do
each operation).
Our implementation forgot to disable the authorization checks when
it's configured to false. The (incorrect) assumption was that when
alternator_enforce_authorization is configured to false, the CQL
'authenticator' and 'authorizer' configuration is also disabled -
so the authorization checks will be no-ops. But we can't assume
that: Users are free to configure 'authenticator' and 'authorizer'
for use in CQL, and then set alternator_enforce_authorization=false
just for Alternator.
So this patch adds a new test for this case - when we have
authenticator=PasswordAuthenticator, authorizer=CassandraAuthorizer
but alternator_enforce_authorization=false, and fixes it to work
correctly.
The heart of the fix is trivial: the `verify_*_permission()` functions
just need to check the alternator_enforce_authorization and return
immediately when false. The bigger part of this change is to get the
alternator_enforce_authorization into the "executor" object and then
to pass it into the verify calls.
Although alternator_enforce_authorization is not YET live updatable,
this code is prepared for the future that it may become live
updatable, so the executor object saves not the boolean value of
this flag, but a live-updatable reference to it.
Fixes#20619
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When access-control checks report permission denied, we want to report
the name of the authenticated role (the role signing the request) which
didn't have the permission. When authentication was disabled, and there
is no authenticated role, we printed the fake name "anonymous", but this
can confuse users (it confused me!) to think there's an actual role
named "anonymous". So let's change that string to "<anonymous>" with
angle brackets - it makes it more obvious that this isn't a real role,
but actually an anonymous request.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
While auditing the code, I noticed that the current Alternator access
control checks have code like:
```
return client_state.check_has_permission(auth::command_desc(
permission_to_check,
auth::make_data_resource(schema->ks_name(), schema->cf_name()))).then(
```
There's a problem here - it turns out that, unfortunately, command_desc
holds a reference to the "resource" object - not a copy. So the temporary
object returned by make_data_resource may be freed and then used...
Curiously, we've not seen a bug caused by this in practice (not even in
debug build mode), but better safe than sorry, so this patch changes the
code in one of two ways:
1. Code using coroutines can keep the "resource" as a variable on the
stack.
2. Code using continuations needs to hold the "resource" with do_with(),
but since this already incurs the cost of an extra allocation
(even in the successful case), might as well just switch to using
coroutines and have less ugly code.
This patch does not change any functionality, and all the tests seem to
work before and after it the same.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
hello
This PR adds the possibility to gather resource consumption metrics. The collected metrics can be used to compare performance before and after specific changes aimed at increasing performance. Currently, this functionality works only in manual mode, and this is just raw data. Later on, these metrics can be used in Jupyter notebook to analyze and visualize how the resources are used and can provide the insight on how to improve it. This PR is a first insight after gathering these metrics.
Add the possibility to gather resource consumption for the test.py execution. SQLite DB will be created with different performance metrics that will allow comparing the resource consumption between changes.
The DB will be in the tmp directory that by default set to testlog. Across the runs, the DB will not be deleted, so each new run will just add information to the existing DB.
Parameter --get-metrics was added to switch on or off the metrics gathering. By default, it's switched on.
Closes: scylladb/qa-tasks#1666Closes: scylladb/qa-tasks#1707Closesscylladb/scylladb#19881
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.
Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.
Fixesscylladb/scylladb#20608
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20609
The `consume*()` variants just forward the call to the `_impl` method with the same name. The latter, being a member of `::impl`, will bypass the top level `fill_buffer()`, etc. methods and thus will never call `set_close_required()`. Do this in the top-level `consume*()` methods instead, to ensure a reader, on which only `consume*()` is called, and then is destroyed, will complain as it should (and abort).
Only one place was found in core code, which didn't close the reader: `split_mutation() in `mutation/mutation.cc` and this reader is the "from-mutation" one which has no real close routine. All other places were in tests. All this is to say, there were no real bugs uncovered by this PR.
Fixes#16520
Improvement, no backport required.
Closesscylladb/scylladb#16522
* github.com:scylladb/scylladb:
readers/flat_mutation_reader_v2: call set_close_required() from consume*()
test/boost/sstable_compaction_test: close reader after use
test/boost/repair_test: close reader after use
mutation/mutation: split_mutation(): close reader after use
"crawling" is a little bit obscure in this context. so let's rename this class to reflect the fact that this reader only reads the entire content of the sstable.
both crawling reader for kl and mx formats are renamed. also, in order to be consistent, all "crawling reader" in variable names are updated as well.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#20599
* github.com:scylladb/scylladb:
sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader
sstable/mx/reader: add comment for mx_crawling_sstable_mutation_reader
Requests sent by S3 are retriable, so when request.write_body() is
called, it should keep everything intact in case http client will call
it again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20579
New sstables for a table are created by the table::make_sstable() method. The method then calls sstables_manager::make_sstable() and passes there a path to component files which, in turn, sits on table::config. Since some time ago having an on-disk path for an sstable had become optional, as sstables could be put on S3 storage without local paths involved. In that case the aforementioned "path" is ~~ab~~used as a key in the system.sstables registry, that references a record with information used to retrieve URLs of sstables' objects.
This PR removes the "path" argument from sstables_manager::make_sstable() and its sstable_sdirectory peer. The details of sstables' location are moved onto storage_options and depend on storage type. For now in both storage types this location is still the good-old $datadir/$keyspace/$table-$uuid string. S3 storage needs to be patched more to use more elegant "location" value.
Eventually the `table::config::{datadir|all_datadirs}` will be removed, this PR is the step towards it.
closes: #12707Closesscylladb/scylladb#20542
* github.com:scylladb/scylladb:
table: Use storage options to clean the storage
sstables/storage: Re-use ocally generated vector of paths
sstables/storage: Visit options once to initialize storage
sstables_manager: Return table storage options when initalizing storage
sstables/storage: Fix indentation after previous patch
table: Move datadirs initialization parallelism to storage level
sstables/storage: Split the visitor's overloaded functor
restore: Don't use table_dir to construct sstable_directory
sstable_directory: Remove table_dir field
sstable_directory: Use options details in lister
sstables_manager: Remove table_dir from make_sstable()
sstables: Remove table_dir from sstable constructor
sstables/storage: Remove sstring dir from make_storage()
sstables/storage: Use options to construct
tests: Properly initialize storage options with "dir"
distributed_loader: Create S3 options with prefix for restore
storage_options: Add special-purpose local options maker
storage_options: Keep local path / s3 prefix onboard
table: Get another options when initializing storage
Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).
The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups. Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.
Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.
Fixesscylladb/scylladb#18862
Needs backport to 6.0 since tablets require this capability
Closesscylladb/scylladb#19555
* github.com:scylladb/scylladb:
sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
sstables: storage: keep base directory in base class
sstables: storage: define opened_directory in header file
sstable_directory: use only dirlog
"crawling" is a little bit obscure in this context. so let's rename this
class to reflect the fact that this reader only reads the entire content
of the sstable.
both crawling reader for kl and mx formats are renamed. also, in order
to be consistent, all "crawling reader" in variable names are updated
as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The cleanup compaction task is a maintenance operation that runs after
topology changes. So, run it under the maintenance scheduling group to
avoid interference with regular compaction tasks. Also remove the share
allocations done by the cleanup task, as they are unnecessary when
running under the maintenance group.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20582
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.
This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.
There is `service_level_controller::get_service_level()` method,
which searches for service level in the controller cache and returns
default service level if SL with given name doesn't exist.
Added method allows to check whether a service level exists in the
controller cache.
That's the most mysterious wrapper in this set as it doesn't need
sstable itself at all, it just duplicates the existing non-class
function out there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This one is a bit tricky, as it needs to modify the sstables's summary.
However, the sstables::test::_summary() one returns mutable reference
and the only caller can use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Same as previous patch -- callers can come with const reference to
summary, so they can live with existing public sstable::get_summary().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just call the public sstable::get_statistics(). The callers would get
const reference on it, but they don't need more than that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The wrapper just changes the order of arguments for a public method.
Drop it, and call the wrapee directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When filtering, we apply single-column and multi-column filters separately.
This is completely unnecessary. Find the multi-column filters during prepare
time and append them to the row-level filter.
This slightly changes the original: in the original, if we had a multi-column
filter, we applied all of the restrictions. But hopefully if we check
for multi-column filters, that's what we need.
The two filters are used in the same way: check the filter, return false if
it matches.
Unify the two filters into a clustering_row_level_filter.
Since one of the two filters wasn't std::optional, we take the liberty
of making the combined filter non-optional.
The two filters are used in the same way: check the filter, set a boolean
flag if it matches, return false. The two boolean flags are in turn checked
in the same way.
Unify the two filters into a partition_level_filter.
Since one of the two filters wasn't std::optional, we take the liberty
of making the combined filter non-optional.
Since 3c7af28725, the cqlsh submodule no longer contains a
bin/cqlsh shell script. This broke the supermodule's bin/cqlsh
shortcut.
Fix it by invoking cqlsh.py directly.
Closesscylladb/scylladb#20591
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.
Ignore cleanup of deallocated storage groups.
Fixes: #19752.
Needs to be backported to all branches with tablets (6.0 and later)
Closesscylladb/scylladb#20584
* github.com:scylladb/scylladb:
test: check if cleanup of deallocated sg is ignored
replica: ignore cleanup of deallocated storage group
To drop a semaphore it should not be held by anyone, so we need to
release out units before checking if a semaphore can be dropped.
Fixes: scylladb/scylladb#20602Closesscylladb/scylladb#20607
as `_bucket` is an `unordered_map<bucket_id, timestamp_bucket_writer>`,
when writing to a given bucket, we try to create a writer with the
specified bucket id, so the returned iterator should point to a node
whose `first` element is always the bucket id.
so, there is no need to reference `it` for the bucket id, let's just
reference the parameter. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20598
Instead of filtering regular and static columns column by column, call
is_satisfied_by() for an expression containing all the static columns
predicates, and one for all the regular column.
We cannot have one expression, since the code sets
_current_static_row_does_not_match only for static columns.
Note the fix for #20485 is now implicit, since the evaluation machinery
will treat missing regular columns as NULL.
Similar to previous work with clustering and partition key, expose
static and reglar column filters as single expressions.
Since we don't currently expose a boolean for whether those filters
exist, we expose them now as non-optionals. In any case evaluating
an empty conjunction is plenty fast.
Instead of filtering the clustering key column by column, call
is_satisfied_by() for an expression containing all the clustering key
predicates.
The check for clustering_key.empty() is removed; the evaluation machinery
is able to handle partial clustering keys. In fact if we add IS NULL,
we have to evaluate as an empty clustering key should match.
cql3::selection performs filtering by consulting
ck_restrictions_need_filtering() and
get_single_column_clustering_key_restrictions() (which is a map of column
definition to expressions). Make them available in one
nice package as an optional<expression>. When the optional is engaged,
filtering is needed, and the expression in the equivalent of all of the
map.
cql3::selection performs filtering by consulting
pk_restrictions_need_filtering() and
get_single_column_partition_key_restrictions() (which is a map of column
definition to expressions). Make them available in one
nice package as an optional<expression>. When the optional is engaged,
filtering is needed, and the expression in the equivalent of all of the
map.
statement_restrictions does not name columns that were used for a secondary
index for selection for filtering, since accessing the index "pre-filters"
these columns.
However, it keeps the relations that contain these columns. This makes
it impossible (besides unnecessary) to evaluate the relations, as the
columns they reference aren't selected.
The reason this works now is that
result_set_builder::restrictions_filter::do_filter() iterates on selected
columns, matching them to relations, then execute the matched relation.
A relation that references an unselected column is invisible to do_filter().
We wish to filter using complete expressions, rather than fragments, so as
a first step remove these unnecessary and unusable relations while we
choose which columns are necessary for filtering.
calculate_column_defs_for_filtering is renamed to remind us of the extra
work done.
The condition seems trivial, but wasn't implemented, without ill effects
so far.
With the following patches, calculate_column_defs_for_filtering() becomes
confused as it selects an indexing code path even when !_uses_secondary_index,
triggered by the reproducer of #10300.
Our UPDATE/INSERT/DELETE statements require a full primary/partition key
and therefore never use indexes; fix the check_index parameter passed from
modification_statement.
So far the bug is benign as we did not take any action on the value.
Make the parameter non-default to avoid such confusion in the future.
The second loop of calculate_column_defs_for_filtering() finds clustering
keys that are used for filtering, minus and clustering keys that happen
to be used for secondary indexing.
However, to check whether the clustering key is used for secondary indexing,
it looks up in _single_column_partition_key_restrictions, which contains
partition key restrictions.
The end result is that we select a column which ends the partition key
for the secondary index, and so is unnecessary. We do a little more work,
but the bug is benign. Nevertheless, fix it, as it interferes with following
work.
get_column_defs_for_filtering() names all the columns that are required
for filtering. While doing that, it skips over columns that are participate
in indexing (primary or secondary), since the index "pre-filters" the
query.
We wish to make use of this skipping. As a first step, call the calculation
from the constructor, so we have control over when it is executed.
Currently, for each column we call get_non_pk_values() to transform
the way we get the information (query::result_row_view) to the way
the expression evaluation machinery wants it (vector<managed_bytes_opt>).
Call it just once outside the loop.
If a regular row isn't present, no regular column restriction
(say, r=3) can pass since all regular columns are presented as NULL,
and we don't have an IS NULL predicate. Yet we just ignore it.
Handle the restriction on a missing column by return false, signifying
the row was filtered out.
We have to move the check after the conditional checking whether there's
any restriction at all, otherwise we exit early with a false failure.
Unit test marked xfail on this issue are now unmarked.
A subtest of test_tombstone_limit is adjusted since it depended on this
bug. It tested a regular column which wasn't there, and this bug caused
the filter to be ignored. Change to test a static column that is there.
A test for a bug found while developing the patch is also added. It is
also tested by test_tombstone_limit, but better to have a dedicated test.
Fixes#10357Closesscylladb/scylladb#20486
This option was silently broken when --enable-tablet's default changed
from false to true. The reason is that when --vnodes is passed, run only
removes --enable-tablets=true from scylla's command line. With the new
default this is not enough, we need to explicitely disable tablets to
override the default.
Closesscylladb/scylladb#20462
Like it was done for table::init_storage(), patch the
table::destroy_storage() not to mess with datadir path and rely on
storage options only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A cleanup after prefious patch -- in order to create storage options for
table the local initialization code can re-use the vector of paths that
it hag generated in the same call to create table directory layout.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The init_table_storage() method now does it twice -- one time to
initialize the storage, another one to create new options for table.
Both can be merged, thus making table storage options initialization
better encapsulated for local/s3 cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the table::init_storage() calls sstables manager two times -- first,
to get storage options, second, to initialize the storage with obtained
options. Merge two calls into one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The table::init_table_storage() calls sstables_manager's storage
initialization for each of the datadirs found on config. That's not
great, it's sstables manager (and its storage) that know if table needs
to mess with datadirs or not. This patch moves the loop to storage.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main goal is to have init_table_storage() overload for local options
as standalone function. This makes next patching simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous patch patching the special-purpose sstable
directory constructor that's used by restore-from-s3-backup code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's no longer needed -- both, lister and making sstable, work with
having storage options at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This class is very similar to sstables::storage one -- it also needs
path or s3 prefix to construct. Now when this information is stored on
storage_options, it's better to stick to it, not to the argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It used to be passed to storage constructor, now storage works with
options only and this argument is no longer needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All callers of make_sstable are now patched to provide correct storage
options with path/prefix set. The make_storage() helper can switch to
using it. Respectively, it's good to make sure that the storage is
created with table options that have path/prefix.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the tests work with local storage options. Some support S3
options as well. Whatever it is, when creating an sstable, tests need to
put proper "dir" on the options, this patch does so.
In fact, storage options for tests are created together with the
test-env, and ideally this is the place where dir should be assigned on
it. However, there are still places that explicitly specify path they
want to see sstables at, for those the new temporary options should be
constructed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Restore-from-backup code wants to collect sstables from remote S3. For
that it constructs S3 options, and now it needs to put prefix on it as
well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Lost of code (in tools and tests) explicitly deal with local sstables
and need to create options for it. Currently default-constructing
options generates local ones, but without the directory path. Add a
helper that creates local options with path and patch callers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when tables keep their own copy of storage options, it's possible
for each table to add table-specific information on it. Namely -- path
for local storage and prefix for S3 one (in fact, it's not a "prefix",
but a key in sstables registry, but fixing it is beyond the scope of
this set).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the table's storage_options life starts in cql, and shortly
after the lw-shared-pointer to options is put on keyspace metadata.
Later, when the table is created the pointer from keyspace is copied on
the table via its contructor.
Next patches will extend the options pointed to by a table, and the
extension is going to be different for different tables. For that, each
table needs to have its private options and this patch prepares for
that.
For now table directly calls sstables/storage code to get the options
from, but it's temporary, soon the options will be created via sstables
manager together with initialising the storage itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is a translation of Cassandra's CQL unit test source file
OperationFctsTest.java into our cql-pytest framework.
This is a massive test suite (over 800 lines of code) for Cassandra's
"arithmetic operators" CQL feature (CASSANDRA-11935), which was added
to Cassandra almost 8 years ago (and reached Cassandra 4.0), but we
never implemented it in Scylla.
All of the tests in suite fail in ScyllaDB due to our lack of this
feature:
Refs #2693: Support arithmetic operators
One test also discovered a new issue:
Refs #20501: timestamp column doesn't allow "UTC" in string format
All the tests pass on Cassandra.
Some of the tests insist on specific error message strings and specific
precision for decimal arithmetic operations - where we may not necessarily
want to be 100% compatible with Cassandra in our eventual implementation.
But at least the test will allow us to make deliberate - and not accidental -
deviations from compatibility with Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20502
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.
Ignore cleanup of deallocated storage group.
The `consume*()` variants just forward the call to the `_impl` method
with the same name. The latter, being a member of `::impl`, will bypass
the top level `fill_buffer()`, etc. methods and thus will never call
`set_close_required()`. Do this in the top-level `consume*()` methods
instead, to ensure a reader, on which only `consume*()` is called, and
then is destroyed, will complain as it should (and abort).
operator()() was also missing `set_close_required()`, fix that too.
Currently, test.py will throw an error if the test will use
BOOST_DATA_TEST_CASE. test.py as a first step getting all test
functions in the file, but when BOOST_DATA_TEST_CASE will be used the
output will have additional lines indicating parametrized test that
test.py can not handle. This commit adds handling this case, as a caveat
all tests should start from 'test' or they will be ignored.
Closes: #20530Closesscylladb/scylladb#20556
This function was obsoleted by schema_builder some time ago. Not to patch all its callers, that helper became wrapper around it. Remained users are all in tests, and patching the to use builder directory makes the code shorter in many cases.
Closesscylladb/scylladb#20466
* github.com:scylladb/scylladb:
schema: Ditch make_shared_schema() helper
test: Tune up indentation in uncompressed_schema()
test: Make tests use schema_builder instead of make_shared_schema
Since #14152 creation of an sstable takes table dir and its state. The
test in question wants to create and sstable in upload/ subdir and for
that it used to maintain full "cf.dir/upload" path, which is not
required any more.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20514
There are two currently -- upload_sink_base and do_upload_file. This PR merges as much code as possible (spoiler: it's already mostly copy-n-pase-d, so squashing is pretty straightforward)
Closesscylladb/scylladb#20568
* github.com:scylladb/scylladb:
s3/client: Reuse class multipart_upload in do_upload_file
s3/client: Split upload_sink_base class into two
the `seastar/core/print.hh` header is no longer required by
`auth/resource.hh`. this was identified by clang-include-cleaner.
As the code is audited, wecan safely remove the #include directive.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20575
When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.
For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.
If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.
In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.
If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.
Fixesscylladb/scylladb#20423Fixesscylladb/scylladb#20424
> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads
Closesscylladb/scylladb#20446
* github.com:scylladb/scylladb:
cql-pytest: add test_compaction_tombstone_gc
sstable_compaction_test: add mv_tombstone_purge_test
sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
sstable_compaction_test: tombstone_purge_test: add testlog debugging
sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
sstable, compaction: add debug logging for extended min timestamp stats
compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
compaction: define max_purgeable_fn
tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
sstables: scylla_metadata: add ext_timestamp_stats
compaction_group, storage_group, table_state: add extended timestamp stats getters
sstables, memtable: track live timestamps
memtable_encoding_stats_collector: update row_marker: do nothing if missing
Since JMX server is deprecated, drop them from submodule, build system
and package definition.
Related scylladb/scylla-tools-java#370
Related #14856
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closesscylladb/scylladb#17969
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#20573
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.
Fixes#19325
Uploading a file is implemented by the do_upload_file class. This class
re-implements a big portion of what's currently in multipart_upload one.
This patch makes the former class inherit from the latter and removes
all the duplication from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This class implements two facilities -- multipart upload protocol itself
plus some common parts of upload_sink_impl (in fact -- only close() and
plugs put(packet)).
This patch aplits those two facilities into two classes. One of them
will be re-used later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the server_impl::applier_fiber is paused by a co_await at line raft/server.cc:1375:
```
co_await override_snapshot_thresholds();
```
a new snapshot may be applied, which updates the actual values of the log's last applied
and snapshot indexes. As a result, the new snapshot index could become higher than the
old value stored in _applied_idx at line raft/server.cc:1365, leading to an assertion
failure in log::last_conf_for().
Since error injection is disabled in release builds, this issue does not affect production releases.
This issue was introduced in the following commit
9dfa041fe1,
when error injection was added to override the log snapshot configuration parameters.
How to reproduce:
1. Build debug version of randomized_nemesis_test
```
ninja-build build/debug/test/raft/randomized_nemesis_test
```
2. Run
```
parallel --halt now,fail=1 -j20 'build/debug/test/raft/randomized_nemesis_test \
--run_test=test_frequent_snapshotting -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 \
--kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --default-log-level \
trace > tmp/logs/eraseme_{}.log 2>&1 && rm tmp/logs/eraseme_{}.log' ::: {1..1000}
```
Fixesscylladb/scylladb#20363Closesscylladb/scylladb#20555
This PR hides the ToC on the Alternator page, as we don't need it, especially at the end of the page.
The ToC must be hidden rather than removed because removing it would, in turn, remove the "Getting Started With ScyllaDB Alternator" and "ScyllaDB Alternator for DynamoDB users" from the page tree and make them inaccessible.
In addition, this PR moves Alternator higher in the page tree.
Fixes https://github.com/scylladb/scylladb/issues/19823Closesscylladb/scylladb#20565
* github.com:scylladb/scylladb:
doc: move Alternator higher in the page tree
doc: hide the redundant ToC on the Alternator page
A CreateTable request defines the KeySchema of the base table and each
of its GSIs and LSIs. It also needs to give an AttributeDefinition for
each attribute used in a KeySchema - which among other things specifies
this attribute's type (e.g., S, N, etc.). Other, non-key, attributes *do
not* have a specified type, and accordingly must not be mentioned in
AttributeDefinitions.
Before this patch, Alternator just ignored unused AttributeDefinitions
entries, whereas DynamoDB throws an error in this case. This patch fixes
Alternator's behavior to match DynamoDB's - and adds a test to verify this.
Besides being more error-path-compatible with DynamoDB, this extra check
can also help users: We already had one user complaining that an
AttributeDefinitions setting he was using was ignored, not realizing
that it wasn't used by any KeySchema. A clear error message would have
saved this user hours of investigation.
Fixes#19784.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20378
The part of the document which explains diagnostics dumps was due for an
update. It was missing an explanation on the dumped stats and it
also needs to explain the "Problematic permit" and "Identified
bottleneck(s)".
Adjust the test reader_concurrency_semaphore_dump_reader_diganostics to
also cover the new diagnostics functionality. The test is not a
correctness test -- the output has to be inspected by a human. But it is
good enough to make sure the code paths do not have any memory errors.
There are a few typical cases of bottlenecks, which can be easily
identified when dumping the semaphore diagnostics. Identify and print
these to fast-track investigations.
In the previous patch, we provided an opportunity for callers to provide
a trigger permit, when calling `maybe_dump_reader_permit_diagnostics()`.
If the caller provided the trigger permit, include its details in the
dump, allowing the identification of the table and code-path of the
permit which triggered the dump.
recently, we are observing errors like:
```
stderr: error running operation: rjson::error (JSON SCYLLA_ASSERT failed on condition 'false', at: 0x60d6c8e 0x4d853fd 0x50d3ac8 0x518f5cd 0x51c4a4b 0x5fad446)
```
we only passed `false` to the `RAPIDJSON_ASSERT()` macro, so what we
have is but the type of the error (rjson::error) and a backtrace.
would be better if we can have more information without recompiling
or fetching the debug symbols for decipher the backtrace.
Refs scylladb/scylladb#20533
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20539
This commit hides the ToC, as we don't need it, especially at the end of the page.
The ToC must be hidden rather than removed because removing it would, in turn,
remove the "Getting Started With ScyllaDB Alternator" and "ScyllaDB Alternator for DynamoDB users"
from the page tree and make them inaccessible.
It is currently unused in `process_on_shard`, which
generates an empty service_permit.
The next patch may call process_on_shard in a loop,
so it can't simply move the permit to the callee
and better hold on to it until processing completes.
`cql_server::connection::process` was turned into
a coroutine in this patch to hold on to the permit parameter
in a simple way. This is a preliminary step to changing
`if (bounce_msg)` to `while (bounce_msg)` that will allow
rebouncing the message in case it moved yet again when
yielding in `process_on_shard`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So it can safely passed between shards, as will be needed
in the following patch that handles a (re)bounce_to_shard result
from process_fn that's called by `process_on_shard` on the
`move_to_shard`.
With that in mind, pass the `bounce_to_shard` payload
to `process_on_shard` rather than the foreign shared ptr
since the latter grabs what it needs from it on entry
and the shared_ptr can be released on the calling shard.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Quoting Avi Kivity:
> Out of scope: we should consider detemplating this.
As a follow-up we should consider that and pass
a function object as process_fn, just make sure
there are no drawbacks.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
add test for issue when writes in commitlog segments pinned to another table can be resurrected.
This test based on dtest code published in #14870 and adapted for community version.
It's a regression test for #15060 fix and should fail before this patch and succeed afterwards.
Refs #14870, #15060Closesscylladb/scylladb#20331
When we introduced optimized clang at 6e487a4, we dropped multiarch build on frozen toolchain, because building clang on QEMU emulation is too heavy.
Actually, even after the patch merged, there are two mode which does not build clang, --clang-build-mode INSTALL_FROM and --clang-build-mode SKIP.
So we should restore multiarch build only these mode, and keep skipping on INSTALL mode since it builds clang.
Since we apply multiarch on INSTALL_FROM mode, --clang-archive replaced
to --clang-archive-x86_64 and --clang-archive-aarch64.
Note that this breaks compatibility of existing clang archive, since it
changes clang root directory name from llvm-project to llvm-project-$ARCH.
Closes#20442Closesscylladb/scylladb#20444
When a read times out, we use different exception types for the permit's
future (if the permit is waiting), or the permit's abort exception _ex
(which is used to abort ongoing reads). This patch changes both to use
named_semaphore_timed_out, which is the more verbose of the two.
Currently the semaphore only dumps diagnostics when a waiting reader
times out. The diagnostics are also useful when a non-waiting reader
(which is in the process of reading) times out, so also dump diagnostics
in this case.
Change the code to use a switch statement, so future addition of states
don't miss updating this logic.
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This PR introduces a new file data source implementation for uncompressed SSTables that will be validating the checksum of each chunk that is being read. Unlike for compressed SSTables, checksum validation for uncompressed SSTables will be active for scrub/validate reads but not for normal user reads to ensure we will not have any performance regression.
It consists of:
* A new file data source for uncompressed SSTables.
* Integration of checksums into SSTable's shareable components. The validation code loads the component on demand and manages its lifecycle with shared pointers.
* A new `integrity_check` flag to enable the new file data source for uncompressed SSTables. The flag is currently enabled only through the validation path, i.e., it does not affect normal user reads.
* New scrub tests for both compressed and uncompressed SSTables, as well as improvements in the existing ones.
* A change in JSON response of `scylla validate-checksums` to report if an uncompressed SSTable cannot be validated due to lack of checksums (no `CRC.db` in `TOC.txt`).
Refs #19058.
New feature, no backport is needed.
Closesscylladb/scylladb#20207
* github.com:scylladb/scylladb:
test: Add test to validate SSTables with no checksums
tools: Fix typo in help message of scylla validate-checksums
sstables: Allow validate_checksums() to report missing checksums
test: Add test for concurrent scrub/validate operations
test: Add scrub/validate tests for uncompressed SSTables
test/lib: Add option to create uncompressed random schemas
test: Add test for scrub/validate with file-level corruption
test: Check validation errors in scrub tests
sstables: Enable checksum validation for uncompressed SSTables
sstables: Expose integrity option via crawling mutation readers
sstables: Expose integrity option via data_consume_rows()
sstables: Add option for integrity check in data streams
sstables: Remove unused variable
sstables: Add checksum in the SSTable components
sstables: Introduce checksummed file data source implementation
sstables: Replace assert with on_internal_error
Fixes#20543
In cql_test_env, if cfg_in.ms_listen is set, we try to get a free port for the current test on
which message service rpc can bind. This to allow multiple tests in parallel.
However, we just do this by using random and getting a number, not actually verifying it against
host ports in use.
This is complicated further by the fact that port reuse is effectively disabled in seastar
(see reactor::posix_reuseport_detect()). Due to this, the solution applied here is a combo
of
* Create temp socket with port = 0 to get a previously free port
* Close socket right before listen (to handle reuse not working)
* Retry on EADDRINUSE
Closesscylladb/scylladb#20547
In 880058073b a new column (request_type)
was added to topology_requests table, but the table's schema version
wasn't changed. Due to that during cluster upgrade, the old and the new
versions occur but they are not distinguishable.
Add offset to schema version of topology_requests table if it contains
request_type column.
Fixes: #20299.
Closesscylladb/scylladb#20402
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.
The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.
New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.
The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.
When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.
Fixesscylladb/scylladb#15329
dtest https://github.com/scylladb/scylla-dtest/pull/4827Closesscylladb/scylladb#19745
* github.com:scylladb/scylladb:
view: test view_build_status table with node replace
test/pylib: use view_build_status_v2 table in wait_for_view
view_builder: common write view_build_status function
view_builder: improve migration to v2 with intermediate phase
view: delete node rows from view_build_status on node removal
view: sanitize view_build_status during migration
view: make old view_build_status table a virtual table
replica: move streaming_reader_lifecycle_policy to header file
view_builder: test view_build_status_v2
storage_service: add view_build_status to raft snapshot
view_builder: migration to v2
db:system_keyspace: add view_builder_version to scylla_local
view_builder: read view status from v2 table
view_builder: introduce writing status mutations via raft
view_builder: pass group0_client and qp to view_builder
view_builder: extract sys_dist status operations to functions
db:system_keyspace: add view_build_status_v2 table
In a previous patch we extended the return status of
`sstables::validate_checksums()` to report if an SSTable cannot be
validated due to a missing CRC component (i.e., CRC.db does not appear
in TOC.txt).
Add a test case for this.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Change the return type of `sstable::validate_checksums()` from binary
(valid/invalid) to a ternary (valid/invalid/no_checksums). The third
status represents uncompressed SSTables without a CRC component (no
entry for CRC.db in the TOC).
Also, change the JSON response of `sstable validate-checksums` to expose
the new status. Replace the boolean value for valid/invalid checksums
with an object that contains two boolean keys: one that indicates if the
SSTable has checksums, and one that indicates if the checksums are valid
or not. The second key is optional and appears only if the SSTable has
checksums.
Finally, update the documentation to reflect the changes in the API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Theoretically it is possible to launch more than one scrub instances
simultaneously. Since the checksum component is a shared resource,
accesses have to be synchronized.
Add a test that launches two scrub operations in validate mode and
ensures that the checksum component is loaded once, referenced by all
scrub instances via shared pointers, and deleted once the scrub
operations finish. Introduce an injection point to achieve concurrent
execution of scrubs.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently the unit tests check scrub in validate mode against compressed
SSTables only. Mirror the tests for uncompressed SSTables as well.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the `random_schema_specification` to support creating both
compressed and uncompressed schemas.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently, we test scrub/validate only against a corrupted SSTable with
content-level corruption (out-of-order partition key).
Add a test for file-level corruption as well. This should trigger the
checksum check in the underlying compressed file data source
implementation.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Scrub was extended in PR #11074 to report validation errors but the
unit tests were not updated.
Update the tests to check the validation errors reported by scrub.
Validation errors must be zero for valid SSTables and non-zero for
invalid SSTables.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the `sstable::validate()` to validate the checksums of
uncompressed SSTables. Given that this is already supported for
compressed SSTables, this allows us to provide consistent behavior
across any type of SSTable, be it either compressed or uncompressed.
The most prominent use case for this is scrub/validate, which is now
able to detect file-level corruption in uncompressed SSTables as
well.
Note that this change will not affect normal user reads which skip
checksum validation altogether.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a new boolean parameter in `sstable::data_stream()` to
enable/disable integrity mechanisms in the underlying data streams.
Currently, this only affects uncompressed SSTables and it allows to
enable/disable checksum validation on each chunk. The validation happens
transparently via the checksummed data source implementation.
The reason we need this option is to allow differentiating the behavior
between normal user reads and scrub/validate reads. We would like to
enable scrub to verify checksums for uncompressed SSTables, while
leaving normal user reads unchanged for performance reasons (read
amplification due to round up of reads to chunk size and loading of the
CRC component).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Remove unused stream variable from `sstable::data_stream()`. This was
introduced in commit 47e07b787e but never used.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Uncompressed SSTables store their checksums in a separate CRC.db file.
Add this in the list of SSTable components.
Since this component is used only for validation, load the component
on-demand for validation tasks and delete it when all validation tasks
finish. In more detail:
- Make the checksum component shareable and weakly referencable.
Also, add a constructor since it is no longer an aggregate.
- Use a weak pointer to store a non-owning reference in the components
and a shared pointer to keep the object alive while validation runs.
Once validation finishes, the component should be cleaned up
automatically.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Introduce a new data source implementation for uncompressed SSTables.
This is just a thin wrapper for a raw data source that also performs
checksum validation for each chunk. This way we can have consistent
behavior for compressed and uncompressed SSTables.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301
Requires a backport to 6.0 and 6.1 as they have the same issue.
Closesscylladb/scylladb#20471
* github.com:scylladb/scylladb:
cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
database::get_all_tables_flushed_at: fix return value
This patch change the alternator documentation to express that the
provisoned units are stored and return but Alternator ignores them.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The test_provisioned_throughput.py test ProvisionedThroughput support.
The first test, check that ProvisionedThroughput can be set and get when
using describe table.
The second test check that missing read or write will throw an
exception.
The third test check that when using billing PAY_PER_REQUEST it returns
zero for the read and write units.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the ability to override the BillingMode. If a
BillingMode is provided to the create_test_table function, it will
override the default BillingMode.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the ability to store and retrieve the
ProvisionedThroughput in a table.
The information is stored in the table tags. We use the TTL convention
used in alternator, and the tags will be: system:provisioned_rcu and
system:provisioned_wcu.
verify_billing_mode function now return a struct with the billing mode
information.
The code of describe_table now check if the provision tags exists and
return the RCU and WCU accordingly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Test tombstone garbage collection with:
1. conflicting live data in memtable (verifying there is no regression
in this area)
2. deletion in memtable (reproducing scylladb/scylladb#20423)
3. materialized view update in memtable (reproducing scylladb/scylladb#20424)
in materialized_views
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than forging a timestamp from the gc_clock
just use `next_timestamp` do it can be considered
for tomebstone purging purposes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When purging regular tombstone consult the min_live_timestamp,
if available.
For shadowable_tombstones, consult the
min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.
If both are missing, fallback to the legacy
(and inaccurate) min_timestamp.
Fixesscylladb/scylladb#20423Fixesscylladb/scylladb#20424
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before we add a new, is_shadowable, parameter to it.
And define global `can_always_purge` and `can_never_purge`
functions, a-la `always_gc` and `never_gc`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And define `never_gc` globally, same as `always_gc`
Before adding a new, is_shadowable parameter to it.
Since it is used in the context of compaction
it better fits compaction_garbage_collector header
rather than tombstone.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Store and retrieve the optional extended timestamp statistics
(min_live_timestamp and min_live_row_marker_timestamp)
in the scylla_metadata component.
Note that there is no need for a cluster feature to
store those attributes since the scylla_metadata
on-disk format is extensible so that old sstables
can be read by new versions, seeing the extra stats
is missing, and new sstables can be read by old
versions that ignore unknown scylla metadata section types.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To return the minimum live timestamp and live row-marker
timestamp across a compaction_group, storage_group, or
table_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When garbage collecting tombstones, we care only
about shadowing of live data. However, currently
we track min/max timestamp of both live and dead
data, but there is no problem with purging tombstones
that shadow dead data (expired or shdowed by other
tombstones in the sstable/memtable).
Also, for shadowable tombstones, we track live row marker timestamps
separately since, if the live row marker timestamp is greater than
a shadowable tombstone timestamp, then the row marker
would shadow the shadowable tombstone thus exposing the cells
in that row, even if their timestasmp may be smaller
than the shadow tombstone's.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refs #18161
Yet another approach to dealing with large commitlog submissions.
We handle oversize single mutation by adding yet another entry
typo: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.
Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
*
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.
On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.
Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.
This approach bumps the file format (docs to come).
To ensure "atomicity" we both force synchronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.
Closesscylladb/scylladb#19472
* github.com:scylladb/scylladb:
commitlog/database: Make some commitlog options updatable + add feature listener
features/config: Add feature for fragmented commitlog entries
docs: Add entry on commitlog file format v4
commitlog_test: Add more oversized cases
commitlog_replayer: Replay segments in order created
commitlog_replayer: Use replay state to support fragmented entries
commitlog_replayer: coroutinize partly
commitlog: Handle oversized entries
If the row_marker is missing then its timestamp
is missing as well, so there's no point
calling update_timestamp for it. Better return early.
This should cause no functional change.
The following patch will add more logic
for tracking extended timestamp stats.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently storage service is drained while group0 is still active. The
draining stops commitlogs, so after this point no more writes are
possible, but if group0 is still active it may try to apply commands
which will try to do writes and they will fail causing group0 state
machine errors. This is benign since we are shutting down anyway, but
better to fix shutdown order to keep logs clean.
Fixesscylladb/scylladb#19665
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
If you are using dbuild, that's where test.py needs to run.
Also, replace 'Docker image' with the more generic 'container' term.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#20336
ScyllaDB doesn't support custom compressors. The available compressors
are the only available ones, not the default ones.
Adjust the text to reflect this.
Closesscylladb/scylladb#20225
The test-env in question is mostly started in one-shard mode. Also there
are several boost tests that start sharded<> environment. In that case
instances on different shards live in different temp dirs. That's not
critical yet, but better to have single directory for the whole test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20412
In test.py every asyncio task spawned during the test must be finished before the next test, otherwise, tests might affect each other results.
The developers are responsible for writing asyncio code in a way that doesn’t leave task objects unfinished.
Test.py has a mechanism that helps test writers avoid such tasks. At the end of each test case, it verifies that the test did not produce/leave any tasks and sets an event object that fails the next test at the start if this is the case(issue https://github.com/scylladb/scylladb/issues/16472)
The problem with this was that breaking the next test was counterintuitive, and the logging for this situation was insufficient and unobvious.
notes: Task.cancel() is not an option to avoid task leakage
1) Calling cancel() Does Not Cancel The Task : the cancel() method just request that the target task cancel.
2) Calling cancel() Does Not Block Until The Task is Cancelled: If the caller needs to know the task is cancelled and done, it could await for the target
3) In particular PR, task.cancel() cancell task on client(ManagerClient) but not on http server(ScyllaManager). so "await" is needed.
Closesscylladb/scylladb#20012
The test in question generates a bunch of table_for_tests objects and
creates sstables for each. For that it calls test_env::make_sstable(),
but it can be made shorter, by calling table method directly.
The hidden goal of this change is to remove the explicit caller of
table::dir() method. The latter is going away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20451
This includes
- coroutinization
- elimination of unused overload
Closesscylladb/scylladb#20456
* github.com:scylladb/scylladb:
test: Squash two open_sstables() helper together
test: Coroutinize open_sstables() helper
This series prepares us for working on #11567 - allow adding a GSI to a pre-existing table. This will require changing the implementation of GSIs in Alternator to not use real columns in the schema for the materialized view, and instead of a computed column - a function which extracts the desired member from the `:attrs` map and de-serializes it.
This series does not contain the GSI re-implementation itself. Rather it contains a few small cleanups and mostly - new regression tests that cover this area, of adding and removing a GSI, and **using** a GSI, in more details than the tests we already had. I developed most of these tests while working on **buggy** fixes for #11567; The bugs in those implementations were exposed by the tests added here - they exposed bugs both in the new feature of adding or removing a GSI, and also regressions to the ordinary operation of GSI. So these tests should be helpful for whoever ends up fixing #11567, be it me based on my buggy implementation (which is _not_ included in this patch series), or someone else.
No backports needed - this is part of a new feature, which we don't usually backport.
Closesscylladb/scylladb#20383
* github.com:scylladb/scylladb:
test/alternator: more extensive tests for GSI with two new key attributes
test/alternator: test invalid key types for GSI
test/alternator: test combination of LSI and GSI
test/alternator: expand another test to use different write operations
test/alternator: test GSIs with different key types
alternator: better error message in some cases of key type mismatch
test/alternator: test for more elaborate GSI updates
test/alternator: strengthen tests for empty attribute values
test/alternator: fix typo in test_batch.py
test/alternator: more checks for GSI-key attribute validation
Alternator: drop unneeded "IS NOT NULL" clauses in MV of GSI/LSI
test/alternator: add more checks for adding/deleting a GSI
test/alternator: ensure table deletions in test_gsi.py
There are some test cases in sstable_directory_test test actually create
a table with CQL and then try to manipulate its sstables with the help
of sstable_directory. Those tests use existing local helper that starts
sharded<sstable_directory> and this helper passes test-local static
schema to sstable_directory constructor. As a result -- the schema of a
table that test case created and the schema that sstable_directory works
with are different. They match in the columns layout, which helps the
test cases pass, but otherwise are two different schema objects with
different IDs. It's more correct to use table schema for those runs.
The fix introduces another helper to start sharded<sstable_directory>,
and the older wrapper around cql_test_env becomes unused. Drop it too
not to encourage future tests use it and re-introduce schema mismatch
again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20499
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.
Note that this requires loading and processing
the base directory first.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, there are leftover log messages using
sstlog rather than dirlog, that was introduced
in aebd965f0e,
and that makes debugging harder.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When split, scrub, and upgrade compactions ran under the compaction
group, they had to bump up their shares to a minimum of 200 to prevent
slow progress as they neared completion, especially in workloads with
inconsistent ingestion rates. Since commit e86965c2 moved these
compactions to the maintenance group, this share bump is no longer
necessary. This patch removes the unnecessary share allocation.
Fixes#20224
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20495
Any expired tombstone can be garbage collected if it doesn't shadow data in the commit log, memtable, or uncompacting SSTables.
This PR introduces a new mode to major compaction, enabled by the `consider_only_existing_data` flag that bypasses these checks. When enabled, memtables and old commitlog segments are cleared with a system-wide flush and all the sstables (after flush) are included in the compaction, so that it works with all data generated up to a given time point.
This new mode works with the assumption that newly written data will not be shadowed by expired tombstones. So it ignores new sstables (and new data written to memtable) created after compaction started. Since there was a system wide flush, commitlog checks can also be skipped when garbage collecting tombstones. Introducing data shadowed by a tombstone during compaction can lead to undefined behavior, even without this PR, as the tombstone may or may not have already been garbage collected.
Fixes#19728Closesscylladb/scylladb#20031
* github.com:scylladb/scylladb:
cql-pytest: add test to verify consider_only_existing_data compaction option
tools/scylla-nodetool: add consider-only-existing-data option to compact command
api: compaction: add `consider_only_existing_data` option
compaction: consider gc_check_only_compacting_sstables when deducing max purgeable timestamp
compaction: do not check commitlog if gc_check_only_compacting_sstables is enabled
tombstone_gc_state: introduce with_commitlog_check_disabled()
compaction: introduce new option to check only compacting sstables for gc
compaction: rename maybe_flush_all_tables to maybe_flush_commitlog
compaction: maybe_flush_all_tables: add new force_flush param
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.
Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
Should be backported to 6.0 and 6.1; the dtest started failing due to topology on raft, which sped up execution of the test and exposed the preexisting problem.
Closesscylladb/scylladb#20488
* github.com:scylladb/scylladb:
test: topology_custom/test_hints: consistency test for decommission
test: topology_custom/test_hints: move sync point helpers to top level
test: topology/util: extract find_server_by_host_id
hints: send hints with CL=ALL if target is leaving
hints: inline do_send_one_mutation
~~~
What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back
to scylla commit f4412029f4 ("docs/docker-hub.md: add quickstart section
with --smp 1", 2020-09-22). Problems with the current language:
- The "65K" claim as default value on non-production systems is wrong;
"fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K.
- The section in question uses equal signs (=) incorrectly. The intent was
probably to say "which means the same as", but that's not what equality
means.
- In the same section, the relational operator "<" is bogus. The available
AIO count must be at least as high (>=) as the requested AIO count.
- Clearer names should be used;
adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc"
sets a great example:
- "reactor::max_aio" should be called "storage_iocbs",
- "detect_aio_poll" should be called "preempt_iocbs",
- "reactor_backend_aio::max_polls" should be called "network_iocbs".
- The specific value 10000 for the last one ("network_iocbs") is not
correct in scylla's context. It is correct as the Seastar default, but
scylla has used 50000 since commit 2cfc517874 ("main, test: adjust
number of networking iocbs", 2021-07-18).
Rewrite the section to address these problems.
See also:
- https://github.com/scylladb/scylladb/issues/5981
- https://github.com/scylladb/seastar/pull/2396
- https://github.com/scylladb/scylladb/pull/19921
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
~~~
No need for backporting; the documentation being refreshed targets developers as audience, not end-users.
Closesscylladb/scylladb#20398
* github.com:scylladb/scylladb:
docs/dev/docker-hub.md: refresh aio-max-nr calculation
docs/dev/docker-hub.md: strip trailing whitespace
The yielding lister is considered to be better replacement that scan_dir(lambda) one.
Also, the sstable directory will be patched to scan the contents of S3 bucket and yielding lister fits better for generalization.
Closesscylladb/scylladb#20114
* github.com:scylladb/scylladb:
sstable_directory: Fix indentation after previous patches
sstable_directory: Use yielding lister in .handle_sstables_pending_delete()
sstable_directory: Use yielding lister in .cleanup_column_family_temp_sst_dirs()
sstable_directory: Use yielding lister in .prepare()
sstable_directory: Shorten lister loop
sstable_directory: Use with_closeable() in .process()
directory_lister: Add noexcept default move-constructor
In sstable directory test there are two of those -- one that works on
path, state, env and callback, and the other one that just needs env and
callback, getting path from env and assuming state is normal.
Two test cases in this test can enjoy the shorter one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20395
On very large node, LimitNOFILES=80000 may not enough size, it can cause
"Too many files" error.
To avoid that, let's increase LimitNOFILES on scylla_setup stage,
generate optimal value calurated from memory size and number of cpus.
Closesscylladb/scylla-enterprise#4304Closesscylladb/scylladb#20443
c70f321c6f added an extra check if KS
exists. This check can throw `data_dictionary::no_such_keyspace`
exception, which is supposed to be caught and a more user-friendly
exception should be thrown instead.
This commit fixes the above problem and adds a testcase to validate it
doesn't appear ever again.
Also, I moved the check for the keyspace outside of the `for` loop, as
it doesn't need to be checked repeatedly.
Fixes: scylladb/scylladb#20097Closesscylladb/scylladb#20404
The case of a GSI with two key attributes (hash and range) which were both
not keys in the base table is a special case, not supported by CQL but
allowed in Alternator. We have several tests for this case, but they don't
cover all the strange possibilities that a GSI row disappears / reappears
when one or two of the attributes is updated / inserted / deleted.
So this patch includes a more extensive test for this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test that types which are not allowed for GSI keys -
basically any type except S(tring), B(ytes) or N(number), are rejected
as expected - an error path that we didn't cover in existing tests.
The new test passes - Alternator doesn't have a bug in this area, and
as usual, also passes on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To allow adding a GSI to an existing table (refs #11567), we plan to
re-implement GSIs to stop forcing their key attribute to become a real
column in the schema - and let it remains a member of the map ":attrs"
like all non-key attributes. But since LSIs can only be defined on table
creation time, we don't have to change the LSI implementation, and these
can still force their key to become a real column.
What the test in this patch does is to verify that using the same
attribute as a key of *both* GSI and LSI on the same table works.
There's a high risk that it won't work: After all, the LSI should force the
attribute to become a real column (to which base reads and writes go), but
the GSI will use a computed column which reads from ":attrs", no? Well,
it turns out that view.cc's value_getter::operator() always had a
surprising exception which "rescues" this test and makes it pass: Before
using a computed column, this code checks if a base-table column with the
same name exists, and if it does, it is used instead of the computed column!
It's not clear why this logic was chosen, but it turns out to be really
useful for making the test in this test pass. And it's important that if
we ever change that unintuitive behavior, we will have this test as a
regression test.
The new test unsurprisingly passes on current Scylla because its
implementation of GSI and LSI is still the same. But it's an important
regression test for when we change the GSI implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Expand another Alternator test (test_gsi.py::test_gsi_missing_attribute)
to write items not just using PutItem, but also using UpdateItem and
BatchWriteItem. There is a risk that these different operations use
slightly different code paths - so better check all of them and not
just PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All of the tests in test/alternator/test_gsi.py use strings as the GSI's
keys. This tests a lot of GSI functionality, but we implicitly assumed that
our implementation used an already-correct and already-tested implementation
of key columns and MV, which if it works for one type, works for other types
as well.
This assumption will no longer hold if we reimplement GSI on a "computed
column" implementation, which might run different code for different types
of GSI key attributes (the supported types are "S"tring, "B"ytes, and
"N"umber).
So in this patch we add tests for writing and reading different types of
GSI key attributes. These tests showed their importance as regression
tests when the first draft of the GSI reimplementation series failed them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Alternator uses a common function get_typed_value() to read the values
of key attribute and confirm they have the expected type (key attributes
have a fixed type in the schema). If the type is wrong, we want to print
a "Type mismatch" error message.
But the current implementation did the checks in the wrong order, and
as a result could print a "Malformed value object" message instead of a
"Type mismatch". That could happen if the wrong type is a boolean, map,
list, or basically any type whose JSON representation is not a string.
The allowed key types - bytes), string and number - all have string
representations in JSON, but still we should first report the mismatched
type and only report the "Malformed object" if the type matches but the
JSON is faulty.
In addition to fixing the error message, we fix an existing test which
complained in a comment (but ignored) that the error message in some
case (when trying to use a map where a key is expected) the strange
"Malformed value object" instead of the expected "Type mismatch".
The next patch will add an additional reproducer for this problem and
its fix. That test will do:
```
with pytest.raises(ClientError, match='ValidationException.*mismatch'):
test_table_gsi_6.put_item(Item={'p': p, 's': True})
```
I.e., it tries to set a boolean value for a string key column, and
expect to get the "Type mismatch" error and not the ugly "Malformed
value object".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most tests in test_gsi.py involve simple updates to a GSI, just
creating a GSI row. Although a couple of tests did involve more
complex operations (such as an update requiring deleting an old row
from the GSI and inserting a new one,), we did not have a single
organized test designed to check all these cases, so we add one in
this patch.
This test (test_update_gsi_pk) will be important for verifying
the low-level implementation of the new GSI implementation that
we plan to based on computed columns. Early versions of that code
passed many of the simpler tests, but not this one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We soon plan to refactor Alternator's GSI and change the validation of
values set in attributes which are GSI keys. It's important to test that
when updating attributes that are *not* GSI keys - and are either base-
table keys or normal non-key attributes - the validation didn't change.
For example, empty strings are still not allowed in base-table key
attributes, but are allowed (since May 2020 in DynamoDB) in non-key
attributes.
We did have tests in this area, but this patch strengthens them -
adding a test for non-key attribute, and expanding the key-attribute
test to cover the UpdateItem and BatchWriteItem operations, not just
PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To enhance the test reports UX:
1. switching off/on passed/failed/skipped test for better visibility
2. better searching in test results
3. understanding the trends of execution for each test
4. better configurability of the final report
Enable allure adapter for all python tests.
Add tags and parameters to the test to be able to distinguish them across modes and runs.
Related: https://github.com/scylladb/qa-tasks/issues/1665
Related: https://github.com/scylladb/scylladb/pull/19335
Related: https://github.com/scylladb/scylladb/pull/18169Closesscylladb/scylladb#19942
* github.com:scylladb/scylladb:
[test.py] Clean duplicated arg for test suite
[test.py] Enable allure for python test
Two tests had a typo 'item' instead of 'Item'. If Scylla had a bug, this
could have caused these tests to miss the bug.
Scylla passes also the fixed test, because Scylla's behavior is correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When an attribute is a GSI key, DynamoDB imposes certain rules when
writing values for it - it must be of the declared type for that key,
and can't be an empty string. We had tests for this, but all of them
did the write using the PutItem operation.
In this patch we also test the same things using the UpdateItem and
BatchWriteItem operations. Because Scylla has different code paths
for these three operations, and each code path needs to remember to
call the validation function, all three should all be checked and not just
PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Scylla's materialized views naturally skip any base rows where the view's
key isn't set (is NULL), because we can't create a view row with a null
key. To make the user aware that this is happening, the user is required
to add "WHERE ... IS NOT NULL" for the view's key columns when defining
the view. However, the only place that these extra IS NOT NULL clauses
are checked are in the CQL "CREATE MATERIALIZED VIEWS" statement - they
are completely ignored in all other places in the code.
In particular, when we create a materialized view in Alternator (GSI or
LSI), we don't have to add these "IS NOT NULL" clauses, as they are
outright ignored. We didn't know they were ignored, and made an effort
to add them - but no matter how incorrectly we did it, it didn't matter :-)
In commit 2bf2ffd3ed it turned out we had a
typo that caused the wrong column name to be printed. Also, even today we
are still missing base key columns that aren't listed as a view key in
Alternator but still added as view clustering keys in Scylla - and again
the fact these were missing also didn't matter. So I think it's time to
stop pretending, and stop calculating these "IS NOT NULL" strings, so
this patch outright removes them from the Alternator view-creation code.
Beyond being a nice cleanup of unnecessary and inaccurate code, it
will also be necessary when we allow in later patches to index for
an Alternator attribute "x" not a real column x in the base table but
rather an element in the ":attrs" map - so adding a "x IS NOT NULL" isn't
only unnecessary, it is outright illegal: The expression evaluation code,
even though it doesn't do anything with the "IS NOT NULL" expression,
still verifies that "x" is a valid column, which it isn't.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have tests for the feature of adding or removing a GSI from
an existing table, which Alternator doesn't yet support (issue #11567).
In this patch we add another check, how after a GSI is added, you can
no longer add items with the wrong type for the indexed type, and after
removing a GSI, you can. The expanded tests pass on DynamoDB, and
obviously still xfail on Alternator because the feature is not yet
implemented.
Refs #11567.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.
We missed a few such cases in test_gsi.py, and fixed some of them in
commit 59c1498338 but still missed a few,
and this patch fixes some more instances of this problem.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.
There are no functional changes in this patch - most of the lines
changed are just reindents.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
in main.cc, we start redis with `ss.local().register_protocol_server()`
only if it is enabled. but `storage_service` always calls
`stop_server()` with _all_ registered server, no matter if they have
started or not. in general, it does not hurt. for instance,
`redis::controller::stop_server()` is a noop, if the controller
is not started. but `storage_service` still print the logging message
like:
```
INFO 2024-09-04 11:20:02,224 [shard 0:main] storage_service - Shutting down redis server
INFO 2024-09-04 11:20:02,224 [shard 0:main] storage_service - Shutting down redis server was successful
```
this could be confusing or at least distracting when a field engineer
looks at the log. also, please note, `redis_port` and `redis_ssl_port`
cannot be changed dynamically once scylla server is up, so we do not
need to worry about "what if the redis server is started at runtime,
how can is be stopped?".
the same applies to alternator service.
in this change, to avoid surprises, we conditionally register the
protocol servers with the storage service based on their enabled statuses.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20472
after switching over to the new `seastar::format()` which enables
the compile-time format check, the fmt string should be a constexpr,
otherwise `fmt::format()` is not able to perform the check at compile
time.
to prepare for bumping up the seastar module to a version which
contains the change of `seastar::format()`, let's mark the format
string with `constexpr const`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20484
Adds the test_hints_consistency_during_decommission test which
reproduces the failure observed in scylladb/scylla-dtest#4582. It uses
error injections, including the newly added
topology_coordinator_pause_after_streaming injection, to reliably
orchestrate the scenario observed there.
In a nutshell, the test makes sure to replay hints after streaming
during decommission has finished, but before the cluster switches to
reading from new replicas. Without the fix, hints would be replayed to
the decommissioned node and then would be lost forever after the cluster
start reading from new replicas.
Move create_sync_point and await_sync_point from the scope of the
test_sync_point test to the file scope. They will be used in a test that
will be introduced in the commit that follows.
Currently, when attempting to send a hint, we might choose its
recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only
send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed
away from that node, it is still considered to be a natural endpoint of
the data that it used to own. Because of that, it might happen that a
hint is sent directly to it but streaming will miss it, effectively
resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea,
send the hint to all replicas also in the case when the original
destiantion of the hint is leaving.
Note that this is a conservative fix written only with the decommission
+ vnode-based keyspaces combo in mind. In general, such "data loss" can
occur in other situations where the replica set is changing and we go
through a streaming phase, i.e. other topology operations in case of
vnodes and tablet load balancing. However, the consistency guarantees of
hinted handoff in the face of topology changes are not defined and it is
not clear what they should be, if there should be any at all. The
picture is further complicated by the fact that hints are used by
materialized views, and sending view updates to more replicas than
necessary can introduce inconsistencies in the form of "ghost rows".
This fix was developed in response to a failing test which checked the
hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
It's a small method and it is only used once in send_one_mutation.
Inlining it lets us get rid of its declaration in the header - now, if
one needs to change the variables passed from one function to another,
it is no longer necessary to change the header.
Shorter and simpler this way. Hopefully it doesn't sit on critical paths
Closesscylladb/scylladb#20460
* github.com:scylladb/scylladb:
sstables: Fix indentation after previous patch
sstables: Coroutinize sstable::read_summary()
The idea of the test is to have a cluster where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the raft state machine.
To achieve this following two lists introduced in the PR:
- ERROR_INJECTIONS in error_injections.py
- CLUSTER_EVENTS in cluster_events.py
Each cluster event is an async generator which has 2 yields and should be used in the following way:
0. Start the generator:
```python
>>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)
```
1. Run the prepare part (before the first yield)
```python
>>> await anext(cluster_event_steps)
```
2. Run the cluster event itself (between the yields)
```python
>>> await anext(cluster_event_steps)
```
3. Run the check part (after the second yield)
```python
>>> await anext(cluster_event, None)
```
Closesscylladb/scylladb#16223
* github.com:scylladb/scylladb:
test: randomized failure injection for Raft-based topology
test: error injections for Raft-based topology
[test.py] topology.util: add get_non_coordinator_host() function
[test.py] random_tables: add UDT methods
[test.py] random_tables: add CDC methods
[test.py] api: get scylla process status
[test.py] api: add expected_server_up_state argument to server_add()
The `service_level::marked_for_deletion` field is always set to `false`. It might have served some purpose in the past, but now it can be just removed, simplifying the code and eliminating confusion about the field.
This is just code cleanup, no backport is needed.
Closesscylladb/scylladb#20452
* github.com:scylladb/scylladb:
service/qos: remove the marked_for_deletion parameter
service/qos: add constructors to service_level
The test cases in this file use an error injection to reduce raft group
0 timeouts (from the default 1 minute), in order to speed up the tests;
the scenarios expect these timeouts to happen, so we want them to happen
as quick as possible, but we don't want to reduce timeouts so much that
it will make other operations fail when we don't expect them to (e.g.
when the test wants to add a node to the cluster).
Unfortunately the selected 5 seconds in debug mode was not enough and
made the tests flaky: scylladb/scylladb#20111.
Increase it to 10 seconds. This unfortunately will slow down these tests
as they have to sometimes wait for 10 seconds for the timeout to happen.
But better to have this than a flaky test.
Fixes: scylladb/scylladb#20111Closesscylladb/scylladb#20320
During review of 0857b63259 it was noticed that the function
repair_get_row_diff_with_rpc_stream_process_op()
and its _slow_path callee only ever return stop_iteration::no (or throw
an exception). As such, its return value is useless, and in fact the
only caller ignores it. Simplify by returning a plain future<>.
Closesscylladb/scylladb#20441
so that it is accessible from its caller. if we enforce the
compile-time format string check, the formatter would need the access to
the specialization of `fmt::formatter` of the arguments being foramtted.
to be prepared for this change, let's move the `fmt::formatter`
specialization up, otherwise we'd have following error after switching
to the compile-time format string check introduced by a recent seastar
change:
```
In file included from ./auth/authenticator.hh:22: ./auth/authentication_options.hh:50:49: error: call to consteval function 'fmt::basic_format_string<char, auth::authentication_option &>::basic_format_string<
char[32], 0>' is not a constant expression
50 | : std::invalid_argument(fmt::format("The {} option is not supported.", k)) {
| ^ ./auth/authentication_options.hh:57:13: error: explicit specialization of 'fmt::formatter<auth::authentication_option>' after instantiation
57 | struct fmt::formatter<auth::authentication_option> : fmt::formatter<string_view> {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/base.h:1228:17: note: implicit instantiation first required here
1228 | -> decltype(typename Context::template formatter_type<T>().format(
| ^
In file included from replica/distributed_loader.cc:30:
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20447
And move the comment inside if while at it, it looks better in there
(and makes less churn in the patch itself)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The idea of the test is to have a small cluster, where one node
is stressed with injections and failures and the rest of
the cluster is used to make progress of the Raft state machine.
To achieve this following two lists introduced in the commit:
- ERROR_INJECTIONS in error_injections.py
- CLUSTER_EVENTS in cluster_events.py
Each cluster event is an async generator which has 2 yields and should be
used in the following way:
0. Start the generator:
>>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)
1. Run the prepare part (before the first yield)
>>> await anext(cluster_event_steps)
2. Run the cluster event itself (between the yields)
>>> await anext(cluster_event_steps)
3. Run the check part (after the second yield)
>>> await anext(cluster_event, None)
Add get_non_coordinator_host() function which returns
ServerInfo for the first host which is not a coordinator
or None if there is no such host.
Also rework get_coordinator_host() to not fail if some
of the hosts don't have a host id.
Allow to return from server_add() when a server reaches specified state.
One of:
- PROCESS_STARTED
- HOST_ID_QUERIED (previously called NOT_CONNECTED)
- CQL_CONNECTED (renamed from CONNECTED)
- CQL_QUERIED (was just QUERIED)
Also, rename CqlUpState to ServerUpState and move to internal_types.
After it was switched to use schema builder, the indenation of untouched
lines deserves one extra space.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Everything, but perf test is straightforward switch.
The perf-test generated regular columns dynamically via vector, with
builder the vector goes away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When creating sstables this test allocates temporary local options.
That works, because this test doesn't run on object storage, but it's
more correct to pick storage options from the table at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20440
Add a test replacing a node and verifying the contents of the
view_build_status table are updated as expected, having rows for the new
node and no rows for the old node.
Change the util function wait_for_view to read the view build status
from the system.view_build_status_v2 table which replaces
system_distributed.view_build_status.
The old table can still be used but it is less efficient because it's
implemented as a virtual table which reads from the v2 table, so it's
better to read directly from the v2 table. This can cause slowness in
tests.
The additional util function wait_for_view_v1 reads from the old table.
This may be needed in upgrade tests if the v2 table is not available
yet.
When writing to the view_build_status we have common logic related to
upgrade and deciding whether to write to sys_dist ks or group0.
Move this common logic to a generic function used by all functions
writing to the table.
Add an intermediate phase to the view builder migration to v2 where we
write to both the old and new table in order to not lose writes during
the migration.
We add an additional view builder version v1_5 between v1 and v2 where
we write to both tables. We perform a barrier before moving to v2 to
ensure all the operations to the old table are completed.
When a node is removed we want to clean its rows from the
view_build_status table.
Now when removing a node and generating the topology state update, we
generate also the mutations to delete all the possible rows belonging to
the node from the table.
When migrating the view_build_status to v2, skip adding any leftover
rows that don't correspond to an existing node or an existing view.
Previously such rows could have been created and not cleaned, for
example when a node is removed.
After migrating the view build status from
system_distributed.view_build_status to system.view_build_status_v2, we
set system_distributed.view_build_status to be a virtual table, such
that reading from it is actually reading from the underlying new table.
The reason for this is that we want to keep compatibility with the old
table, since it exists also in Cassandra and it is used by various external
tools to check the view build status. Making the table virtual makes the
transition transparent for external users.
The two tables are in different keyspaces and have different shard
mapping. The v1 table is a distributed table with a normal shard
mapping, and the v2 table is a local table using the null sharder. The
virtual reader works by constructing a multishard reader which reads the rows
from shard zero, and then filtering it to get only the rows owned by the
current shard.
Add tests to verify the new view_build_status_v2 is used by the
view_builder and can be read from all nodes with the expected values.
Also test a migration from the v1 layout to v2.
Migrate view_builder to v2, to store the view build status of all nodes
in the group0 based table view_build_status_v2.
Introduce a feature view_build_status_on_group0 so we know when all
nodes are ready to migrate and use the new table.
A new cluster is initialized to use v2. Otherwise, The topology coordinator
initiates the migration when the feature is enabled, if it was not done
already.
The migration reads all the rows in the v1 table and writes it via
group0 to the v2 table, together with a mutation that updates the
view_builder parameter in scylla_local to v2. When this mutation is
applied, it updates the view_builder service to start using the v2
table.
Add a new scylla_local parameter view_builder_version, and functions to
read and mutate the value.
The version value defaults to v1 if it doesn't exist in the table.
Update the view_status function to read from the new
view_build_status_v2 table when enabled.
The code to read and extract the values is identical to v1 and v2 except it
accesses different keyspace and table, so the common code is extracted
to the view_status_common function and used by both v1 and v2 flows with
appropriate parameters.
Introduce the announce_with_raft function as alternative to writing view build
status mutations to the table in system_distributed. Instead, we can
apply the mutations via group0 operation to the view_build_status_v2
table.
All the view_builder functions that write to the view_build_status table
can be configured by a flag to either write the legacy way or via raft.
Store references of group0_client and query_processor in the
view_builder service.
They are required for generating mutations and writing them via group0.
Because of https://github.com/scylladb/scylladb/issues/9285 heat weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.
Fixesscylladb/scylladb#20430Closesscylladb/scylladb#20414
When testing mv admission control, we perform a large view update
and check if the following view update can be admitted due to the
high view backlog usage. We rely on a delay which keeps the backlog
high for longer to make sure the backlog is still increased during
the second write. However, in some test runs the delay is not long
enough, causing the second write to miss the large backlog and not
hit admission control.
In this patch we keep the increased backlog high using another
injection instead of relying on a delay to make absolute sure
that the backlog is still high during the second write.
Fixesscylladb/scylladb#20382Closesscylladb/scylladb#20445
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:
- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
checking the SSTables being compacted. Memtables, commit logs, and
other SSTables not part of the compaction will not be checked, as they
will only contain newer data that arrived after the compaction
started.
The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.
The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.
Fixes#19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When gc_check_only_compacting_sstables is enabled,
get_max_purgeable_timestamp should not check memtables and other
sstables that are not part of the compaction to deduce the max purgeable
timestamp.
Refs #19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When the compaction_descriptor's gc_check_only_compacting_sstables flag
is enabled, create and pass a copy of the get_tombstone_gc_state that
will skip checking the commitlog.
Refs #19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a new method, `with_commitlog_check_disabled`, that returns a new
copy of the tombstone_gc_state but with commitlog check disabled. This
will be used by a following patch to disable commitlog checks during
compaction.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added new option, `gc_check_only_compacting_sstables`, to
compaction_descriptor to control the garbage collection behavior. The
subsequent patches will use this flag to decide if the garbage
collection has to check only the SSTables being compacted to collect
tombstones. This option is disabled for now and will be enabled based on
a new compaction parameter that will be added later in this patch
series.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Major compaction flushes all tables as a part of flushing the commitlog.
After forcing new active segments in the commitlog, all the tables are
flushed to enable reclaim of older commitlog segments. The main goal is
to flush the commitlog and flushing all the table is just a dependency.
Rename maybe_flush_all_tables to maybe_flush_commitlog so that it
reflects the actual intent of the major compaction code. Added a new
wrapper method to database::flush_all_tables(),
database::flush_commitlog(), that is now called from
maybe_flush_commitlog.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add a new parameter, `force_flush` to the maybe_flush_all_tables()
method. Setting `force_flush` to true will flush all the tables
regardless of when they were flushed last. This will be used by the new
compaction option in a following patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Before the introduction of "scripts/refresh-submodules.sh", there was
indeed some manual work for the maintainer to do, hence "publish your
work" must have sounded correct. Today, the phrase "publish your work"
sounds confusing.
Commit 71da4e6e79 ("docs: Document sync-submodules.sh script in
maintainer.md", 2020-06-18) should have arguably reworded the last step of
the submodule refresh procedure; let's do it now.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#20333
The with_sstable_dir() helper no longer needs one, it used to pass it as
argument to sstable_directory constructor, but now the directory doesn't
need it (takes semaphore via table object).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20396
Squash call to lister.get() and check for the returned value into
while()'s condition. This saves few more lines of code as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method already uses yielding lister, but handles the exceptions
explicitly. Use with_closeable() helper, it makes the code shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's required to make it possible to push lister into with_closeable().
Its requiremenent of nothrow-move-constructible doesn't accept
default-generated one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The `skip()` method of the compressed data source implementation uses an
assert statement to check if the given offset is valid.
Replace this with `on_internal_error()` to fail gracefully. An invalid
offset shouldn't bring the whole server down.
Also, enhance the error message for unsynced compressed readers.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
It used to rely on bool (wrapped with pointer) and future<>-based loop
helper, now it can just break from the while loop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And update its callers again.
Preserve no longer relevant local smart pointers until next patch.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One accepts integer generations, another one accepts "generic" ones. The
latter is only called by the former, so no sense in keeping it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a default constructor and a constructor which explicitly
initializes all fields of the service_level structure.
This is done in order to make sure that removal of the
marked_for_deletion field can be done safely - otherwise, for example,
service_level could be aggregate-initialized with an incomplete list of
values for the fields, and removing marked_for_deletion which is in the
middle of the struct would cause the is_static field to be initialized
with the value that was designated for marked_for_deletion.
As a bonus, make sure that marked_for_deletion and is_static bool fields
are initialized in the default constructor to false in order to avoid
potential undefined behavior.
There are two versions of `raft_group0_client::hold_read_apply_mutex`, one takes `abort_source&`, the other doesn't. Modify all call sites that used the non-abort-source version to pass an `abort_source&`, allowing us to remove the other overload.
If there is no explicit reason not to pass an `abort_source&`, then one should be passed by default -- it often prevents hangs during shutdown.
---
No backport needed -- no known issues affected by this change.
Closesscylladb/scylladb#19996
* github.com:scylladb/scylladb:
raft_group0_client: remove `hold_read_apply_mutex` overload without `abort_source&`
storage_service: pass `_abort_source` to `hold_read_apply_mutex`
group0_state_machine: pass `_abort_source` to `hold_read_apply_mutex`
api: move `reload_raft_topology_state` implementation inside `storage_service`
for following reasons:
1. the ppa in question does not provide the build for the latest ubuntu's LTS release. it only builds for trusty, xenial, bionic and jammy. according to https://wiki.ubuntu.com/Releases, the latest LTS release is ubuntu noble at the time of writing.
2. the ppa in question does not provide the packages used in production. it does provides the package for *building* scylla
3. after we introduced the relocatable package, there is no need to provide extra user space dependencies apart from scylla packages.
so, in this change, we remove all references to enabling the Scylla/PPA repository.
Fixesscylladb/scylladb#20449
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20450
for better readability.
presumably, `sstable::seal_sstable()` is not on the critical path,
and we don't need to worry about the overhead of using C++20 coroutine.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20410
instead of chaining the conditions with '&&', break them down.
for two reasons:
* for better readability: to group the conditions with the same
purpose together
* so we don't look up the table twice. it's an anti-pattern of
using STL, and it could be confusing at first glance.
this change is a cleanup, so it does not change the behavior.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20369
The Alternator TTL scanning code uses an object "scan_ranges_context"
to hold the scanning context. One of the members of this object is
a service::query_state, and that in turn holds a reference to a
service::client_state. The existing constructor created a temporary
client_state object and saved a reference to it - which can result
in use after free as the temporary object is freed as soon as the
constructor ends.
The fix is to save a client_state in the scan_ranges_context object,
instead of a temporary object.
Fixes#19988
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20418
Makes some commitlog options runtime updatable. Most important for this case,
the usage of fragmented entries. Also adds a subscription in database on said
feature, to possibly enable once cluster enables it.
Hides the functionality behind a cluster feature, i.e. postspones
using it until an upgrade is complete etc. This to allow rolling back
even with dirty nodes, at least until a cluster is commited.
Feature can also be disabled by scylla option, just in case. This will
lock it out of whole cluster, but this is probably good, because depending
on off or on, certain schema/raft ops might fail or succeed (due to large
mutations), and this should probably be equivalent across nodes.
Refs #18161
Yet another approach to dealing with large commitlog submissions.
We handle oversize single mutation by adding yet another entry
type: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.
Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
*
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.
On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.
Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.
This approach bumps the file format (docs to come).
To ensure "atomicity" we both force syncronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.
v2:
* Improve some bookeep, ensure we keep track of segments and flush
properly, to get counter correct
This commit temporarily disables redirections for all pages under Features
that were moved with this PR: https://github.com/scylladb/scylladb/pull/20401
Redirections work for all versions. This means that pages in 6.1 are redirected
to URLs that are not available yet (because 6.2 has not been released yet).
The redirections are correct and should be enabled when 6.2 is released:
I've created an issue to do it: https://github.com/scylladb/scylladb/issues/20428Closesscylladb/scylladb#20429
Tests that try to access sstables from test/resource/ typically sstable::load() it after object creation. There's reusable_sst() helper for that. This PR fixes one more caller that still goes longer route by doing sstable and loading it on its own.
Closesscylladb/scylladb#20420
* github.com:scylladb/scylladb:
test: Call reusable sst from ka_sst() helper
test: Move sstable_open_config to reusable_sst()'s argument
There's no point waiting for this lock if `storage_service` is being
aborted. In theory the lock, if held, should be eventually released by
whatever is holding it during shutdown -- but if there is some cyclic
reference between the services, and e.g. whatever holds the lock is
stuck because of ongoing shutdown and would only be unstuck by
`storage_service` getting stopped (which it can't because it's waiting
on the lock), that would cause a shutdown deadlock. Better to be safe
than sorry.
In later commit we'll want to access more `storage_service` internals
in the API's implementation (namely, `_abort_source`)
Also moving the implementation there allows making
`service::topology_transition()` private again (it was made public in
992f1327d3 only for this API
implementation)
Run the reversed queries on a 2-node cluster with CL=ALL with and
without NATIVE_REVERSE_QUERIES feature flag. When the flag is enabled,
the native reversed format is used, otherwise the legacy format.
The NATIVE_REVERSE_QUERIES feature flag is suppressed with an error
injection that simulates cluster upgrade process.
Backport is not required. The patch adds additional upgrade tests
for https://github.com/scylladb/scylladb/pull/18864Closesscylladb/scylladb#20179
This patch address two requests made by reviewers of the original "Add
CQL-based RBAC support to Alternator" series. Both requests were about
the error messages produced when access is denied:
1. The error message is improved to use more proper English, and also
to include the name of the role which was denied access.
2. The permission-check and error-message-formatting code is
de-duplicated, using a common function verify_permission().
This de-duplication required moving the access-denied error path to
throwing an exception instead of the previous exception-free
implementation. However, it can be argued that this change is actually
a good thing, because it makes the successful case, when access is
allowed, faster.
The de-duplicated code is shorter and simpler, and allowed changing
the text of the error message in just one place.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20326
in 372a4d1b79, we introduced a change
which was for debugging the logging message. but the logging message
intended for printing the temp_dir not prints an `optional<int>`. this
is both confusing, and more importantly, it hurts the debuggability.
in this change, the related change is reverted.
Fixesscylladb/scylladb#20408
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20409
The sstable_mutation_test wants to load pre-existing sstables from
resouce/ subdir. For that there's reusable_sst() helper on env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back
to scylla commit f4412029f4 ("docs/docker-hub.md: add quickstart section
with --smp 1", 2020-09-22). Problems with the current language:
- The "65K" claim as default value on non-production systems is wrong;
"fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K.
- The section in question uses equal signs (=) incorrectly. The intent was
probably to say "which means the same as", but that's not what equality
means.
- In the same section, the relational operator "<" is bogus. The available
AIO count must be at least as high (>=) as the requested AIO count.
- Clearer names should be used;
adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc"
sets a great example:
- "reactor::max_aio" should be called "storage_iocbs",
- "detect_aio_poll" should be called "preempt_iocbs",
- "reactor_backend_aio::max_polls" should be called "network_iocbs".
- The specific value 10000 for the last one ("network_iocbs") is not
correct in scylla's context. It is correct as the Seastar default, but
scylla has used 50000 since commit 2cfc517874 ("main, test: adjust
number of networking iocbs", 2021-07-18).
Rewrite the section to address these problems.
See also:
- https://github.com/scylladb/scylladb/issues/5981
- https://github.com/scylladb/seastar/pull/2396
- https://github.com/scylladb/scylladb/pull/19921
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
This commit one of the series to remove the FAQ page by removing irrelevant/outdated entries
or moving them to the forum.
The question about seeds is irrelevant, not frequently asked, and covered in other sections
of the docs. Also, it mentions versions that are no longer supported.
Closesscylladb/scylladb#20403
Even after 13caac7, we still have more files incorrect permission, since
we use "cp -r" and creating new file with redirect.
To fix this, we need to replace "cp -r" with "cp -pr", and "chmod <perm>" on
newly created files.
Fixes#14383
Related #19775Closesscylladb/scylladb#19786
This commit moves the Features page from the section for developers
to the top level in the page tree. This involves:
- Moving the source files to the *features* folder from the *using-scylla* folder.
- Moving images into *features/images* folder.
- Updating references to the moved resources.
- Adding redirections to the moved pages.
Closesscylladb/scylladb#20401
this change contains two improvements to "backup" and "restore" commands:
- let them print task id
- let them return 1 as the exist status code upon operation failure
----
these changes are improvements to the newly introduced commands, which are not in any LTS branches yet, so no need to backport.
Closesscylladb/scylladb#20371
* github.com:scylladb/scylladb:
tools/scylla-nodetool: return failure with exit code in backup/restore
tools/scylla-nodetool: let backup/restore print task id
before this change, "backup" and "restore" commands always return 0 as
their exist code no matter if the performed operation fails or not.
inspired by the "task" commands of nodetool, let's return 1 with
exit code if the operation fails.
the tests are updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 20fffcdc, we added the "task wait" subcommand, so user is allowed to
interact with a task with its task id. and in existing implementation of
"backup" and "restore" command, if user does not pass `--nowait`, the
command just exits without any output upon sending the request to
scylladb.
in this change, we print out the task_id if user does not pass
`--nowait` command line option to "backup" or "restore" command. this
allows user to follow up on the operation if necessary.
the tests are updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch adds functional testing for the role-based access control
(RBAC) "auto-grant" feature, where if a user that is allowed to create
a table, it also recieves full permissions over the table it just
created. We also test permissions over new materialized views created
by a user, and over CDC logs. The test for CDC logs reproduces an
already suspected bug, #19798: A user may be allowed to create a table
with CDC enabled, but then is not allowed to read the CDC log just
created. The tests show that the other cases (base tables and views)
do not have this bug, and the creating user does get appropriate
permissions over the new table and views.
In addition to testing auto-grant, the patch also includes tests for
the opposite feature, "auto-revoke" - that permissions are removed when
the table/view/cdc is deleted. If we forget to do that while implementing
auto-grant, we risk that users may be able to use tables created by
other users just because they used the same table _name_ earlier.
It's important to have these auto-revoke tests together with the
auto-grant tests that reproduce #19798 - so we don't forget this
part when finally fixing #19798.
Refs #19798.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19845
Bind variables in CQL have two formats: positional (`?`) where a variable is referred to by its relative position in the statement, and named (`:var`), where the user is expected to supply a name->value mapping.
In 19a6e69001 we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the `dialect` and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes#15559
This may be useful to users transitioning from Cassandra, so merits a backport.
Closesscylladb/scylladb#19493
* github.com:scylladb/scylladb:
cql3: add option to not unify bind variables with the same name
cql3: introduce dialect infrastructure
cql3: prepared_statement_cache: drop cache key default constructor
* in the "Backporting Seastar commits" section, there's a single quote
instead of a backtick in this line, so fix it.
* add backticks around `refresh-submodules.sh`, which is a filename.
* correct the command line setting a git config option, because `git-config`
does not support this command line syntax,
```console
$ git config --global diff.conflictstyle = diff3
$ git config --global get diff.conflictstyle
=
$ git config --global diff.conflictstyle diff3
$ git config --global get diff.conflictstyle
diff3
```
quote from git-config(1)
> ```
> git config set [<file-option>] [--type=<type>] [--all] [--value=<value>] [--fixed-value] <name> <value>
> ```
* stop using the deprecated mode of the `git-config` command, and use
subcommand instead. as git-config(1) puts:
> git config <name> <value> [<value-pattern>]
> Replaced by git config set [--value=<pattern>] <name> <value>.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20328
Check if podman is available before docker. If it is, use it. Otherwise, check for docker.
1. Podman is better. It runs with fewer resources, and I've had display issues with Docker (output was not shown consistently)
2. 'which docker' works even when the docker service and socket are turned off.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#20342
Triggers the "Build Docs" PR workflow whenever the `db/config.cc` or `db/config.h` files are edited. These files are used to produce documentation, and this change will help prevent the introduction of breaking changes to the documentation build when they are modified.
Closesscylladb/scylladb#20347
for testing the load performance of load_and_stream operation.
Refs #19989
---
no need to backport. it adds two new tests to the existing `perf_sstable` tool for evaluating the load performance when performing the "load_and_streaming" operation. hence has no impact on the production.
Closesscylladb/scylladb#20186
* github.com:scylladb/scylladb:
perf/perf_sstable: add {crawling,partitioned}_streaming modes
test/perf/perf_sstable: use switch-case when appropriate
instead of evaluating the constants in-class, accessing them via
a cached class property.
it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available without
a file being debugged. what's why gdb fails to load the init script:
```
Traceback (most recent call last):
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
class intrusive_slist:
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
size_t = gdb.lookup_type('size_t')
^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```
so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.
the reason is that the script accesses the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.
in this change, we transform all these class variables to cached
properties, so that they
* are evaluated on-demand
* are evaluated only once at most
this addresses the pain at the expense of verbosity.
---
this change intends to improve the developer's user experience, and has no impacts on product, so no need to backport.
Closesscylladb/scylladb#20334
* github.com:scylladb/scylladb:
test/scylla_gdb: test the .gdb init use case
scylla-gdb.py: lazy-evaluate the constants
All users of it have sstable_test_env at hand (in fact -- they call env
method to get table_for_test). And since sstable_test_env already has a
bunch of methods to create sstable, the table_for_test wrapper doesn't
need to duplicate this code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20360
so that we can set this the parameter passed to `-inline-threshold` with
`configure.py` when building with CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20364
before this change, if user does not have `/bin/sh` around, when
installing scylla packages, the script in `%pretrans" is executed,
and fails due to missing `/bin/sh`. per
https://docs.fedoraproject.org/en-US/packaging-guidelines/Scriptlets/#pretrans
> Note that the %pretrans scriptlet will, in the particular case of
> system installation, run before anything at all has been installed.
> This implies that it cannot have any dependencies at all. For this
> reason, %pretrans is best avoided, but if used it MUST (by necessity)
> be written in Lua. See
> https://rpm-software-management.github.io/rpm/manual/lua.html for more
> information.
but we were trying to warn users upgrading from scylla < 1.7.3, which
was released 7 years ago at the time of writing.
in this change, we drop the `%pretrans` section. hopefuly they will
find their way out if they still exist.
Fixesscylladb/scylladb#20321
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20365
before this change, when running `scylla-housekeeping`:
```
/opt/scylladb/scripts/libexec/scylla-housekeeping:122: SyntaxWarning: invalid escape sequence '\s'
match = re.search(".*http.?://repositories.*/scylladb/([^/\s]+)/.*/([^/\s]+)/scylladb-.*", line)
```
we could have the warning above. because `\s` is not a valid escape
sequence, but the Python interpreter accepts it as two separated
characters of `\s` after complaining. but it's still annoying.
so, let's use a raw string here.
Refs scylladb/scylladb#20317
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20359
before this change, when building the test of `view_build_test` with
clang-20, we can have following build failure:
```
FAILED: test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o -MF test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o.d -o test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o -c /home/kefu/dev/scylladb/test/boost/view_build_test.cc
/home/kefu/dev/scylladb/test/boost/view_build_test.cc:998:5: error: unknown type name 'simple_schema'
998 | simple_schema ss;
| ^
```
apparently, `simple_schema`'s declaration is not available in this
translation unit.
in this change
* we include the header where `simple_schema` is defined, so that
the build passes with clang-20.
* also take this opportunity to reorder the header a little bit,
so the testing headers are grouped together.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20367
in a recent seastar change (644bb662), we do not include
`seastar/testing/random.hh` in `seastar/testing/test_runner.hh` anymore,
as the latter is not a facade of the former, and neither does it use the
former. as a sequence, some tests which take the advantage of the
included `seastar/testing/random.hh` do not build with the latest
seastar:
```
FAILED: test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o
/usr/bin/clang++ -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DBOOST_UNIT_TEST_FRAMEWORK_DYN_LINK -DBOOST_UNIT_TEST_FRAMEWORK_NO_LIB -DDEVEL -DFMT_SHARED -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -DSEASTAR_API_LEVEL=7 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/__w/scylladb/scylladb -I/__w/scylladb/scylladb/build/gen -I/__w/scylladb/scylladb/seastar/include -I/__w/scylladb/scylladb/build/seastar/gen/include -I/__w/scylladb/scylladb/build/seastar/gen/src -I/__w/scylladb/scylladb/build -isystem /__w/scylladb/scylladb/abseil -isystem /__w/scylladb/scylladb/build/rust -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/__w/scylladb/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -MD -MT test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o -MF test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o.d -o test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o -c /__w/scylladb/scylladb/test/lib/key_utils.cc
In file included from /__w/scylladb/scylladb/test/lib/key_utils.cc:11:
/__w/scylladb/scylladb/test/lib/random_utils.hh:25:30: error: no member named 'local_random_engine' in namespace 'seastar::testing'
25 | return seastar::testing::local_random_engine;
| ~~~~~~~~~~~~~~~~~~^
1 error generated.
```
in this change, we include `seastar/testing/random.hh` when the random
facility is used, so that they can be compiled with the latest seastar
library.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20368
under the hood, std::map::count() and std::map::contains() are nearly
identical. both operations search for the given key witin the map.
however, the former finds a equal range with the given
key, and gets the distance between the disntance between the begin
and the end of the range; while the later just searches with the given
key.
since scylla-nodetool is not a performance-critical application, the
minor difference in efficiency between these two operations is unlikely
to have a significant impact on its overall performance.
while std::map::count() is generally suitable for our need, it might be
beneficial to use a more appropriate API.
in this change, we use std::map::contains() in the place of
std::map::count() when checking for the existence of a paramter with
given name.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20350
for better readability
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#20366
* github.com:scylladb/scylladb:
compaction: use std::views::reverse when appropriate
compaction: use structured binding when appropriate
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.
In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes#15559
All seeds hostname resolution errors will be ignored during a node
restart in case the node had already joined a cluster. This will
prevent restart errors if some seed names are not resolvable.
Fixesscylladb/scylladb#14945Closesscylladb/scylladb#20292
* github.com:scylladb/scylladb:
Ignore seed name resolution errors on restart.
Add a test for starting with a wrong seed.
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.
The series adds the test for this scenario and the fix for the
chicken&egg problem above.
The series (or at least the fix itself) needs to be backported because
this is a serious regression.
Fixes: scylladb/scylladb#19025Closesscylladb/scylladb#20290
* github.com:scylladb/scylladb:
topology coordinator: fix indentation after the last patch
topology coordinator: do not add replacing node without a ring to topology
test: add test for replace in clusters with encryption enabled
test.py: add server encryption support to cluster manager
.gitignore: fix pattern for resources to match only one specific directory
before this change, we run all the tests in a single pytest session,
with scylladb debug symbols loaded. but we want to test another use
case, where the scylladb debug symbols are missing.
in this change,
* we do not check for the existence of debug symbols until necessary
* add a mark named "without_scylla"
* run the tests in two pytest sessions
- one with "without_scylla" mark
- one with "not without_scylla" mark
* add a test which is marked with the "without_scylla" mark. the test
verify that the scylla-gdb.py script can be loaded even without
scylladb debug symbols.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of evaluating the constants in-class, accessing them via
a cached class property.
it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available with
a file being debugged. so when gdb fails to load init script:
```
Traceback (most recent call last):
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
class intrusive_slist:
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
size_t = gdb.lookup_type('size_t')
^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```
so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.
the reason is that the script access the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.
in this change, we transform all these class variables to cached
property, so that they
* are evaluated on-demand
* are evaluated only once at most
this addresses the pain at the expense of verbosity.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.
Throw if batchlog manager isn't initialized.
Fixes: #20236.
Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access.
Closesscylladb/scylladb#20251
* github.com:scylladb/scylladb:
test: add test to ensure repair won't fail with uninitialized bm
repair: throw if batchlog manager isn't initialized
This commit replaces the 6.0-to-6.1 upgrade guide with the 6.1-to-6.2 upgrade guide.
The new guide is a template that covers the basic procedure.
If any 6.2-specific updates are required, they will have to be added along with development.
Closesscylladb/scylladb#20178
In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`,
we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda
to improve the readability. However, there was a subtle problem related to
lifetimes of the captures that needed to be addressed:
* Since we started `co_await`ing in the lambda, the captures were at risk of
being destructed too soon. The usual solution is to wrap a coroutine lambda
within a `seastar::coroutine::lambda` object and rely on the extended lifetime
enforced by the semantics of the language.
See `docs/dev/lambda-coroutine-fiasco.md` for more context.
* However, since we don't immediately `co_await` the future returned by
`with_gate()`, we cannot rely on the extended lifetime provided by the wrapper.
The document linked in the previous bullet point suggests keeping the passed
coroutine lambda as a variable and pass it as a reference to `with_gate()`.
However, that's not feasible either because we discard the returned future and
the function returns almost instantly -- destructing every local object, which
would encompass the lambda too.
The solution used in the commit was to move captures of the lambda into
the lambda's body. That helped because Seastar's backend is responsible for
keeping all of the local variables alive until the lambda finishes its execution.
However, we didn't move all of the captures into the lambda -- the missing one
was the `this` pointer that was implicitly used in the lambda.
Address sanitiser hasn't reported any bugs related to the pointer yet, but
the bug is most likely there.
In this commit, we transform the lambda's body into a new member function
and only call it from the lambda. This way, we don't need to care about
the lifetimes of the captures because Seastar ensures that the function's
arguments stay alive until the coroutine finishes.
Choosing this solution instead of assigning `this` to a pointer variable
inside the lambda's body and using it to refer to the object's members
has actual benefit: it's not possible to accidentally forget to refer
to a member of the object via the pointer; it also makes the code less
awkward.
Fixesscylladb/scylladb#20306Closesscylladb/scylladb#20258
* github.com:scylladb/scylladb:
db/hints: Fix indentation in `do_store_hint()`
db/hints: Move code for writing hints to separate function
Add nodetool commands to manage task manager tasks:
- tasks abort - aborts the task
- tasks list - lists all tasks in the module
- tasks modules - lists all modules
- tasks set-ttl - sets task ttl
- tasks status - gets status of the task
- tasks tree - gets statuses of the task and all its desendent's
- tasks ttl - gets task ttl
- tasks wait - waits for the task and gets its status
Fixes: https://github.com/scylladb/scylladb/issues/19201.
Closesscylladb/scylladb#19614
* github.com:scylladb/scylladb:
test: nodetool: add tests for tasks commands
nodetool: tasks: add nodetool commands to track task manager tasks
api: task_manager: return status 403 if a task is not abortable
api: task_manager: return none instead of empty task id
api: task_manager: add timeout to wait_task
api: task_manager: add operation to get ttl
nodetool: add suboperations support
nodetool: change operations_with_func type
nodetool: prepare operation related classes for suboperations
A dialect is a different way to interpret the same CQL statement.
Examples:
- how duplicate bind variable names are handled (later in this series)
- whether `column = NULL` in LWT can return true (as is now) or
whether it always returns NULL (as in SQL)
Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.
The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.
~~~
generic_server: convert connection tracking to seastar::gate
If we call server::stop() right after "server" construction, it hangs:
With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,
co_await _all_connections_stopped.get_future();
at the end of server::stop() deadlocks.
Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when
- cserver->start() (sharded<cql_server>::start()) constructs a
"server"-derived object,
- start_listening_on_tcp_sockets() throws an exception before reaching
listen_on_all_shards() (for example because it fails to set up client
encryption -- certificate file is inaccessible etc.),
- the "deferred_action"
cserver->stop().get();
is invoked during cleanup.
(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)
Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.
NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:
- server::shutdown() + server::stop()
- or just server::stop().
Fixes#10305
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
~~~
Fixes#10305.
I think we might want to backport this -- it fixes a hang-on-misconfiguration which affects `scylla-6.1.0-0.20240804.abbf0b24a60c.x86_64` minimally. Basically every release that contains commit ae4d5a60ca has a theoretical chance for the hang, and every release that contains commit 6b178f9a4a has a practical chance for the hang.
Focusing on the more practical symptom (i.e., releases containing commit 6b178f9a4a), `git tag --contains 6b178f9a4a90` gives us (ignoring candidates and release candidates):
- scylla-6.0.0
- scylla-6.0.1
- scylla-6.0.2
- scylla-6.1.0
Closesscylladb/scylladb#20212
* github.com:scylladb/scylladb:
generic_server: make server::stop() idempotent
generic_server: coroutinize server::shutdown()
generic_server: make server::shutdown() idempotent
test/generic_server: add test case
configure, cmake: sort the lists of boost unit tests
generic_server: convert connection tracking to seastar::gate
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.
during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.
Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.
Fixes#20339.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#20029
Handed over from https://github.com/scylladb/scylladb/pull/20149
This adds minimal implementation of the start-restore API call.
The method starts a task that runs load-and-stream functionality against sstables from S3 bucket. Arguments are:
```
endpoint -- the ID in object_store.yaml config file
bucket -- the target bucket to get objects from
keyspace -- the keyspace to work on
table -- the table to work on
snapshot -- the name of the snapshot from which the backup was taken
```
The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion.
Remote sstables components are scanned as if they were placed in local upload/ directory. Then colelcted sstables are fed into load-and-stream.
This branch has https://github.com/scylladb/scylladb/pull/19890 (Integrated backup), https://github.com/scylladb/scylladb/pull/20120 (S3 lister) and few more minor PRs merged in. The restore branch itself starts with [utils: Introduce abstract (directory) lister](29c867b54d) commit.
refs: https://github.com/scylladb/scylladb/issues/18392Closesscylladb/scylladb#20305
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add restore integration
test/object_store: Add simple restore test
test/object_store: Generalize prepare_snapshot_for_backup()
code: Introduce restore API method
sstable_loader: Add sstables::storage_manager dependency
sstable_loader: Maintain task manager module
sstable_loader: Out-line constructor
distributed_loader: Split get_sstables_from_upload_dir()
sstables/storage: Compose uploaded sstable path simpler
sstable_directory: Prepare FS lister to scan files on S3
sstable_directory: Parse sstable component without full path
s3-client: Add support for lister::filter
utils: Introduce abstract (directory) lister
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.
The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.
From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.
Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.
The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.
Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.
Zero-token nodes could also be used as load balancers in the
Alternator.
Additionally, this PR fixesscylladb/scylladb#11087, which turned out to
be a blocker.
This PR introduced a new feature. There is no need to backport it.
Fixesscylladb/scylladb#6527Fixesscylladb/scylladb#11087Fixesscylladb/scylladb#15360Closesscylladb/scylladb#19684
* github.com:scylladb/scylladb:
docs: raft: document using zero-token nodes to prevent majority loss
test: test recovery mode in the presence of zero-token nodes
test: topology: util.py: add cqls parameter to check_system_topology_and_cdc_generations_v3_consistency
test: topology: util.py: accept zero tokens in check_system_topology_and_cdc_generations_v3_consistency
treewide: support zero-token nodes in the recovery mode
storage_proxy: make TRUNCATE work locally for local tables
test: topology: util.py: document that check_token_ring_and_group0_consistency fails with zero-token nodes
test: test zero-token nodes
test: test_topology_ops: move helpers to topology/util.py
feature_service: introduce the ZERO_TOKEN_NODES feature
storage_service: rename join_token_ring to join_topology
storage_service: raft_topology_cmd_handler: improve warnings
topology_coordinator: fix indentation after the previous patch
treewide: introduce support for zero-token nodes in Raft topology
system_keyspace: load_topology_state: remove assertion impossible to hit
treewide: distinguish all nodes from all token owners
gossip topology: make a replacing node remove the replaced node from topology
locator: topology: add_or_update_endpoint: use none as the default node state
test: boost: tablets tests: ensure all nodes are normal token owners
token_metadata: rename get_all_endpoints and get_all_ips
network_topology_strategy: reallocate_tablets: remove unused dc_rack_nodes
virtual_tables: cluster_status_table: execute: set dc regardless of the token ownership
When only inter dc encryption is enabled a non encrypted connection
between two nodes is allowed only if both nodes are in the same dc.
If a nodes that initiates the connection knows that dst is in the same
dc and hence use non encrypted connection, but the dst not yet knows the
topology of the src such connection will not be allowed since dst cannot
guaranty that dst is in the same dc.
Currently, when topology coordinator is used, a replacing node will
appear in the coordinator's topology immediately after it is added to the
group0. The coordinator will try to send raft message to the new node
and (assuming only inter dc encryption is enabled and replacing node and
the coordinator are in the same dc) it will try to open regular, non encrypted,
connection to it. But the replacing node will not have the coordinator
in it's topology yet (it needs to sync the raft state for that). so it
will reject such connection.
To solve the problem the patch does not add a replacing node that was
just added to group0 to the topology. It will be added later, when
tokens will be assigned to it. At this point a replacing node will
already make sure that its topology state is up-to-date (since it will
execute a raft barrier in join_node_response_params handler) and it knows
coordinator's topology. This aligns replace behaviour with bootstrap
since bootstrap also does not add a node without a ring to the topology.
The patch effectively reverts b8ee8911caFixes: scylladb/scylladb#19025
In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`,
we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda
to improve the readability. However, there was a subtle problem related to
lifetimes of the captures that needed to be addressed:
* Since we started `co_await`ing in the lambda, the captures were at risk of
being destructed too soon. The usual solution is to wrap a coroutine lambda
within a `seastar::coroutine::lambda` object and rely on the extended lifetime
enforced by the semantics of the language.
See `docs/dev/lambda-coroutine-fiasco.md` for more context.
* However, since we don't immediately `co_await` the future returned by
`with_gate()`, we cannot rely on the extended lifetime provided by the wrapper.
The document linked in the previous bullet point suggests keeping the passed
coroutine lambda as a variable and pass it as a reference to `with_gate()`.
However, that's not feasible either because we discard the returned future and
the function returns almost instantly -- destructing every local object, which
would encompass the lambda too.
The solution used in the commit was to move captures of the lambda into
the lambda's body. That helped because Seastar's backend is responsible for
keeping all of the local variables alive until the lambda finishes its execution.
However, we didn't move all of the captures into the lambda -- the missing one
was the `this` pointer that was implicitly used in the lambda.
Address sanitiser hasn't reported any bugs related to the pointer yet, but
the bug is most likely there.
In this commit, we transform the lambda's body into a new member function
and only call it from the lambda. This way, we don't need to care about
the lifetimes of the captures because Seastar ensures that the function's
arguments stay alive until the coroutine finishes.
Choosing this solution instead of assigning `this` to a pointer variable
inside the lambda's body and using it to refer to the object's members
has actual benefit: it's not possible to accidentally forget to refer
to a member of the object via the pointer; it also makes the code less
awkward.
Modify operation and add operation_action class so that information
about suboperations is stored. It's a preparation for adding
suboperations support to nodetool.
before this change, we included `-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.`
in cflags when building the tree with CMake, but this was wrong.
as the "." directory is the build directory used by CMake. and this
directory is specified by the `-B` option when generating the building
system. if `configure.py --use-cmake` is used to build the tree,
the build directory would be "build". so this option instructs the compiler
to replace the directory of source file in the debug symbols and in
`__FILE__` at compile time.
but, in a typical workspace, for instance, `build/main.cc` does not exist.
the reason why this does not apply to CMake but applies to the rules
generated by `configure.py` is that, `configure.py` puts the generated
`build.ninja` right under the top source directory, so `.` is correct and
it helps to create reproducible builds. because this practically erases
the path prefixes in the build output. while CMake puts it under the
specified build directory, replacing the source directory with the build
directory with the file prefix map is just wrong.
there are two options to address this problem:
* stop passing this option. but this would lead to non-reproducible
builds. as we would encode the build directory in the "scylla"
executable. if a developer needs to rebuild an executable for debugging
a coredump generated in production, he/she would have to either
build the tree in the same directory as our CI does. or, he/she
has to pass `-ffile-prefix-map=...` to map the local build directory
to the one used by CI. this is not convenient.
* instead of using `${CMAKE_SOURCE_DIR}=.`, add `${CMAKE_BINARY_DIR}=.`.
this erases the build directory in the outputs, but preserves the
debuggability.
so we pick the second solution.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20329
We modify existing tests to verify that the recovery mode works
correctly in the presence of zero-token nodes.
In `test_topology_recovery_basic`, we test the case when a
zero-token node is live. In particular, we test that the
gossip-based restart of such a node works.
In `test_topology_recovery_after_majority_loss`, we test the case
when zero-token nodes are unrecoverable. In particular, we test
that the gossip-based removenode of such nodes works.
Since zero-token nodes are ignored by the Python driver if it also
connects to other nodes, we use different CQL sessions for a
zero-token node in `test_topology_recovery_basic`.
In the following commit, we modify `test_topology_recovery_basic`
to test the recovery mode in the presence of live zero-token nodes.
Unfortunately, it requires a bit ugly workaround. Zero-token nodes
are ignored by the Python driver if it also connects to other
nodes because of empty tokens in the `system.peers` table. In that
test, we must connect to a zero-token node to enter the recovery
mode and purge the Raft data. Hence, we use different CQL sessions
for different nodes.
In the future, we may change the Python driver behavior and revert
this workaround. Moreover, the recovery tests will be removed or
significantly changed when we implement the manual recovery tool.
Therefore, we shouldn't worry about this workaround too much.
Before we use `check_system_topology_and_cdc_generations_v3_consistency`
in a test with a zero-token node, we must ensure it doesn't fail
because of zero tokens in a row of the `system.topology` table.
Before we implement the manual recovery tool, we must support
zero-token nodes in the recovery mode. This means that two topology
operations involving zero-token nodes must work in the gossip-based
topology:
- removing a dead zero-token node,
- restarting a live zero-token node.
We make changes necessary to make them work in this patch.
In on of the following patches, we implement support for zero-token
nodes in the recovery mode. To achieve this, we need to be able to
purge all Raft data on live zero-token nodes by using TRUNCATE.
Currently, TRUNCATE works the same for all replication strategies - it
is performed on all token owners. However, zero-token nodes are not
token owners, so TRUNCATE would ignore them. Since zero-token nodes
store only local tables, fixing scylladb/scylladb#11087 is the perfect
solution for the issue with zero-token nodes. We do it in this patch.
Fixesscylladb/scylladb#11087
We add tests to verify the basic properties of zero-token nodes.
`test_zero_token_nodes_no_replication` and
`test_not_enough_token_owners` are more or less deterministic tests.
Running them only in the dev mode is sufficient.
`test_zero_token_nodes_topology_ops` is quite slow, as expected,
considering parameterization and the number of topology operations.
In the future we can think of making it faster or skipping in the
debug mode. For now, our priority is to test zero-token nodes
thoroughly.
In one of the following patches, we reuse the helper functions from
`test_topology_ops` in a new test, so we move them to `util.py`.
Also, we add the `cl` parameter to `start_writes`, as the new test
will use `cl=2`.
Zero-token nodes must be supported by all nodes in the cluster.
Otherwise, the non-supporting nodes would crash on some assertion
that assumes only token-owing normal nodes make sense.
Hence, we introduce the ZERO_TOKEN_NODES cluster feature. Zero-token
nodes refuse to boot if it is not supported.
I tested this patch manually. First, I booted a node built in the
previous patch. Then, I tried to add a zero-token node built in this
patch. It refused to boot as expected.
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.
The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.
From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.
Topology operations involving zero-token nodes are simplified:
- `add` and `replace` finish in the `join_group0` state, so creating
a new CDC generation and streaming are skipped,
- `removenode` and `decommission` skip streaming,
- `rebuild` does not even contact the topology coordinator as there
is nothing to rebuild,
Also, if the topology operation involves a token-owning node,
zero-token nodes are ignored in streaming.
Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.
The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.
Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.
Zero-token nodes could also be used as load balancers in the
Alternator.
In one of the following patches, we introduce support for zero-token
nodes. From that point, getting all nodes and getting all token
owners isn't equivalent. In this patch, we ensure that we consider
only token owners when we want to consider only token owners (for
example, in the replication logic), and we consider all nodes when
we want to consider all nodes (for example, in the topology logic).
The main purpose of this patch is to make the PR introducing
zero-token nodes easier to review. The patch that introduces
zero-token nodes is already complicated. We don't want trivial
changes from this patch to make noise there.
This patch introduces changes needed for zero-token nodes only in the
Raft-based topology and in the recovery mode. Zero-token nodes are
unsupported in the gossip-based topology outside recovery.
Some functions added to `token_metadata` and `topology` are
inefficient because they compute a new data structure in every call.
They are never called in the hot path, so it's not a serious problem.
Nevertheless, we should improve it somehow. Note that it's not
obvious how to do it because we don't want to make `token_metadata`
store topology-related data. Similarly, we don't want to make
`topology` store token-related data. We can think of an improvement
in a follow-up.
We don't remove unused `topology::get_datacenter_rack_nodes` and
`topology::get_datacenter_nodes`. These function can be useful in the
future. Also, `topology::_dc_nodes` is used internally in `topology`.
In the following patch, we change the gossiper to work the same for
zero-token nodes and token-owning nodes. We replace occurrences of
`is_normal_token_owner` with topology-based conditions. We want to
rely on the invariant that token-owning nodes own tokens if and only
if they are in the normal or leaving state. However, this invariant
is broken by a replacing node because it does not remove the
replaced node from topology. Hence, after joining, the replacing node
has topology with a node that is not a token owner anymore but is
in a leaving state (`being_replaced`). We fix it to prevent the
following patch from introducing a regression.
In one of the following patches, we change the gossiper to work the
same for zero-token nodes and token-owning nodes. We replace
occurrences of `is_normal_token_owner` with topology-based
conditions. We want to rely on the invariant that token-owning nodes
own tokens if and only if they are in the normal or leaving state.
However, this invariant can be broken in the gossip-based topology
when a new node joins the cluster. When a boostrapping node starts
gossiping, other nodes add it to their topology in
`storage_service::on_alive`. Surprisingly, the state of the new node
is set to `normal`, as it's the default value used by
`add_or_update_endpoint`. Later, the state will be set to
`bootstrapping` or `replacing`, and finally it will be set again to
`normal` when the join operation finishes. We fix this strange
behavior by setting the node state to `none` in
`storage_service::on_alive` for nodes not present in the topology.
Note that we must add such nodes to the topology. Other code needs
their Host ID, IP, and location.
We change the default node state from `normal` to `none` in
`add_or_update_endpoint` to prevent bugs like the one in
`storage_service::on_alive`. Also, we ensure that nodes in the `none`
state are ignored in the getters of `locator::topology`.
In one of the following patches, we make NetworkTopologyStrategy
and the tablet load balancer consider only normal token owners to
ensure they ignore zero-token nodes. Some unit tests would start
failing after this change because they do not ensure that all
nodes are normal token owners. This patch prevents it.
Judging by the logic in the test cases in
`network_topology_strategy_test`, `point++` was probably intended
anyway.
In one of the following patches, we introduce support for zero-token
nodes. A zero-token node that has successfully joined the cluster is
in the normal state but is not a normal token owner. Hence, the names
of `get_all_endpoints` and `get_all_ips` become misleading. They
should specify that the functions return only IDs/IPs of token owners.
before this change, memtable serves as the fixture for 6 test cases,
actually these 6 test cases can be categorized into a matrix of 3 x 2:
{ single_row, multi_row, large_partition } x { single_partition, multi_paritition }.
in this change, we break memtable into 3 different fixtures, to reflect
this fact. more readable this way. and a benefit is that each test does
not have to pay for the overhead of setup it does not use at all.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20177
as part of the efforts to address scylladb/scylladb#2717, we are
switching over to the CMake-based building system, and fade out the
mechinary to create the rules manually in `configure.py`.
in this change, we add `--no-use-cmake` to `configure.py`, it serves
two purposes:
* prepare for the change which enables cmake by default, by then,
we would set the default value of `use_cmake` to True, and allow
user to keep using the existing mechinary in the transition period
using `--no-use-cmake`.
* allows the CI to tell if a tree is able to build with CMake.
the command line option of `--use-cmake` is also used by the CI
workflows, and is passed to `configure.py` if `BUILD_WITH_CMAKE`
jenkins pipeline parameter is set. but not all branches with
`--use-cmake` are ready to build with CMake -- only the latest
master HEAD is ready. so the CI needs to check the capability of
building with CMake by looking at the output of `configure.py --help`,
to see if it includes `--no-use-cmake`.
after this change lands. we will remove the `BUILD_WITH_CMAKE`
parameter, and use cmake as long as `configure.py` supports
`--no-use-cmake` option.
the existing mechinary will stay with us for a short transition
period so that developers can take time to get used to the
usage of the naming of targets and the new directory arrangement.
as a side effect, #20079 will be fixed after switching to CMake.
---
this is a cmake-related change, hence no need to backport.
Closesscylladb/scylladb#20261
* github.com:scylladb/scylladb:
build: add --no-use-cmake option to configure.py
build: let configure.py fail if unknown option is passed to it
The test test_streams.py::test_stream_list_tables reproduces a bug where
enabling streams added a spurious result to ListTables. A reviewer of
that patch asked to also add a check that name of the table itself
doesn't disappear from ListTables when a stream is enabled, so this is
what this patch adds.
This theoretical scenario (a table's name disappearing from ListTables)
never happened, so the new check doesn't reproduce any known bug, but
I guess it never hurts to make the test stronger for regression testing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19934
The code tries to build as "neighbors" an unordered_map from an iterator
of std::tuple, instead of the correct std::pair. Apparently, the tuples
are transparently converted to pairs on the newest compilers and the
whole works, but on slightly older compilers (like the one on Fedora 39)
Scylla no longer compiles - the compiler complains it can't convert a
tuple to a pair in this context.
So fix the code to use pairs, not tuples, and it fixes the build on
Fedora 39.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20319
as we have an API for restore a keyspace / table, let's expose this feature
with nodetool. so we can exercise it without the help of scylla-manager
or 3rd-party tools with a user-friendly interface.
in this change:
* add a new subcommand named "restore" to nodetool
* add test to verify its interaction with the API server
* update the document accordingly.
* the bash completion script is updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The test shows how to restore previously backed up table:
- backup
- truncate to get rid of existing sstables
- start restore with the new API method
- wait for the task to finish
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method starts a task that uses sstables_loader load-and-stream
functionality to bring new sstables into the cluster. The existing
load-and-stream picks up sstables from upload/ directory, the newly
introduced task collects them from S3 bucket and given prefix (that
correspond to the path where backup API method put them).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Gossiper seeds host name resolution failures are ignored during restart if
a node is already boostrapped (i.e. it has successfully joined the cluster).
Fixesscylladb/scylladb#14945
Check whether we can stop a generic server without first asking it to
listen.
The test fails currently; the failure mode is a hang, which triggers the 5
minute timeout set in the test:
> unknown location(0): fatal error: in "stop_without_listening":
> seastar::timed_out_error: timedout
> seastar/src/testing/seastar_test.cc(43): last checkpoint
> test/boost/generic_server_test.cc(34): Leaving test case
> "stop_without_listening"; testing time: 300097447us
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Both lists were obviously meant to be sorted originally, but by today
we've introduced many instances of disorder -- thus, inserting a new test
in the proper place leaves the developer scratching their head. Sort both
lists.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
If we call server::stop() right after "server" construction, it hangs:
With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,
co_await _all_connections_stopped.get_future();
at the end of server::stop() deadlocks.
Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when
- cserver->start() (sharded<cql_server>::start()) constructs a
"server"-derived object,
- start_listening_on_tcp_sockets() throws an exception before reaching
listen_on_all_shards() (for example because it fails to set up client
encryption -- certificate file is inaccessible etc.),
- the "deferred_action"
cserver->stop().get();
is invoked during cleanup.
(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)
Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.
NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:
- server::shutdown() + server::stop()
- or just server::stop().
Fixes#10305
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
as part of the efforts to address scylladb/scylladb#2717, we are
switching over to the CMake-based building system, and fade out the
mechinary to create the rules manually in `configure.py`.
in this change, we add `--no-use-cmake` to `configure.py`, it serves
two purposes:
* prepare for the change which enables cmake by default, by then,
we would set the default value of `use_cmake` to True, and allow
user to keep using the existing mechinary in the transition period
using `--no-use-cmake`.
* allows the CI to tell if a tree is able to build with CMake.
the command line option of `--use-cmake` is also used by the CI
workflows, and is passed to `configure.py` if `BUILD_WITH_CMAKE`
jenkins pipeline parameter is set. but not all branches with
`--use-cmake` are ready to build with CMake -- only the latest
master HEAD is ready. so the CI needs to check the capability of
building with CMake by looking at the output of `configure.py --help`,
to see if it includes --no-use-cmake`.
after this change lands. we will remove the `BUILD_WITH_CMAKE`
parameter, and use cmake as long as `configure.py` supports
`--no-use-cmake` option.
the existing mechinary will stay with us for a short transition
period so that developers can take time to get used to the
usage of the naming of targets and the new directory arrangement.
as a side effect, #20079 will be fixed after switching to CMake.
Refs scylladb/scylladb#2717
Refs scylladb/scylladb#20079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this allows us to use `configure.py` to tell if a certain argument is supported
without parsing its output. in the next commit, we will add `--no-use-cmake` option,
which will be used to tell if the tree is ready for using CMake for its building
system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in `configure.py`, a set of options are specified when configuring
seastar, but not all of them were ported to scylla's CMake building
system.
for instance, `configure.py` explicitly disables io_uring reactor
backend at build time, but the CMake-based system does not.
so, in this change, in order to preserve the existing behavior, let's
port the two previously missing option to CMake-based building system
as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20288
Attempting to read a partition via `SELECT * FROM MUTATION_FRAGMENTS()`, which the node doesn't own, from a table using tablets causes a crash.
This is because when using tablets, the replica side simply doesn't handle requests for un-owned tokens and this triggers a crash.
We should probably improve how this is handled (an exception is better than a crash), but this is outside the scope of this PR.
This PR fixes this and also adds a reproducer test.
Fixes: https://github.com/scylladb/scylladb/issues/18786
Fixes a regression introduced in 6.0, so needs backport to 6.0 and 6.1
Closesscylladb/scylladb#20109
* github.com:scylladb/scylladb:
test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
replica/mutation_dump: enfore pinning of effective replication map
replica/mutation_dump: handle un-owned tokens (with tablets)
When executing reversed queries, a native revered format shall be used. Therefore, the table schema and the clustering key bounds are reversed before a partition slice and a read command are constructed.
It is, however, possible to run a reversed query passing a table schema but only when there are no restrictions on the clustering keys. In this particular situation, the query returns correct results. Since the current alternator tests in test.py do not imply any restrictions, this situation was not caught during development of https://github.com/scylladb/scylladb/pull/18864.
Hence, additional tests are provided that add clustering keys restrictions when executing reversed queries to capture such errors earlier than in dtests.
Additional manual tests were performed to test a mixed-node cluster (with alternator API enabled in Scylla on each node):
1. 2-node cluster with one node upgraded: reverse read queries performed on an old node
2. 2-node cluster with one node upgraded: reverse read queries performed on a new node
3. 2-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on an old node
4. 2-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on a new node
All reverse read queries above consists of:
- single-partition reverse reads with no clustering key restrictions, with single column restrictions and multi column restrictions both with and without paging turned on
The exact same tests were also performed on a fully upgraded cluster.
Fixes https://github.com/scylladb/scylladb/issues/20191
No backport is required as this is a complementary patch for the series https://github.com/scylladb/scylladb/pull/18864 that did not require backporting.
Closesscylladb/scylladb#20205
* github.com:scylladb/scylladb:
test_query.py: Test reverse queries with clustering key bounds
alternator::do_query Add additional trace log
alternator::do_query: Use native reversed format
alternator::do_query Rename schema with table_schema
Currently, the `force` property of the `source_dc` rebuild option
is lost and `raft_topology_cmd_handler` has no way to know
if it was given or not.
This in turn can cause rebuild to fail, even when `--force`
is set by the user, where it would succeed with gossip
topology changes, based on the source_dc --force semantics.
Fixesscylladb/scylladb#20242
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20249
* seastar a7d81328...83e6cdfd (29):
> fair_queue: Export the number of times class was activated
> tests/unit: drop support of C++17
> remove vestigial OSv support
> cmake: undefine _FORTIFY_SOURCE on thread.cc
> container_perf: a benchmark for container perf
> io_sink: use chunked_fifo as _pending_io container
> chunked_fifo: implement clear in terms of pop_n
> chunked_fifo: pop_front_n
> io_sink: use iteration instead of indexing
> json2code_test: choose less popular port number
> ioinfo: add '--max-reqsize' parameter
> treewide: drop the support of fmtlib < 8.0.0
> build: bump up the required fmtlib version to 8.1.1
> conditional-variable: align when() and wait() behaviour in case of a predicate throwing an exception
> stall-analyser: add output support for flamegraph
> reactor: Add --io-completion-notify-ms option
> io_queue: Stall detector
> io_queue: Keep local variable with request execution delay
> io_queue: Rename flow ratio timer to be more generic
> reactor: Export _polls counter (internally)
> dns: de-inline dns_resolver::impl methods
> dns: enter seastar::net namespace
> dnf: drop compatibility for c-ares <= 1.16
> reactor: add missing includes of noncopyable_function.hh
> reactor: Reset one-shot signal to DFL before handling
> future: correctly document nested exception type emitted by finally()
> modules: fix FATAL_ERROR on compiler check
> seastar.cc: include fmt/ranges.h
> pack io_request
Closesscylladb/scylladb#20300
`dist-check` tests the generated rpm packages by installing them in a centos 7 container. but this script is terribly outdated
- centos 7 is deprecated. we should use a new distro's latest stable release.
- cqlsh was added to the family of rpms a while ago. we should test it as well.
- the directory hierarchy has been changed. we should read the artifacts from the new directories.
- cmake uses a different directory hierarchy. we should check the directory used by cmake as well.
to address these breaking changes, the scripts are updated accordingly.
---
this change gives an overhaul to a test, which is not used in production. so no need to backport.
Closesscylladb/scylladb#20267
* github.com:scylladb/scylladb:
tools/testing: add cqlsh rpm
tools/testing: adapt to cmake build directory
tools/testing: test with rockylinux:9 not centos:7
tools/testing: correct the paths to rpm packages and SCYLLA-*-FILE
dist-check: add :z option when mapping volume
The storage_manager maintains set of clients to configured object
storage(s). The sstables loader is going to spawn tasks that will talk
to to those storages, thus it needs the storage manager to get the
clients clients from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is going to start tasks managed by task manager. For that,
it should have its module set up and registered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need this method to initialize sstable_directory
differently and then do its regular processing. For that, split the
method into two, next patch will re-use the common part it needs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current S3 storage driver keeps sstables in bucket in a form of
/bucket/generation/component-name
To get sstables that are backed up on S3 this format doesn't apply,
because components are uploaded with their names unmodified. This patch
makes S3 storage driver account for that and not re-format component
paths for upload sstable state.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When component lister is created it checks the target storage options
for what kind of lister to create. For local options it creates FS
lister that collects sstables from their component files. For S3
options, it relies on sstables registry.
When collecting sstables from backup, it's not possible to use registry,
because those entries are not there. Instead, lister should pick up
individual components as it they were on local FS. This patch prepares
the lister for that -- in case S3 options are provided and the sstables'
state is "upload", don't try to read those from registry, but
instantiate the FS lister that will later use s3::bucket_lister.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sstable directory collects a entry from storage, it tries to parse
its full path with the help of sstables::parse_path(). There are two
overloads of that function -- one with ks:cf arguments and one without.
The latter tries to "guess" keyspace and table names from the directory
name.
However, ks and table names are already known by the directory, it
doesn't even use the returned ks and cf values, so this parsing is
excessive. Also, future patches will put here backup paths, that might
not match the ks_name/table_name-table_uuid/ pattern that the parser
expects.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Directory lister comes with a filter function that tells lister which
entries to skip by its .get() method. For uniformity, add the same to
S3 bucket_lister.
After this change the lister reports shorter name in the returned
directory entry (with the prefix cut), so also need to tune up the unit
test respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch hides directory_lister and bucket_lister behind a common
facade. The intention is to provide a uniform API for sstable_directory
that it could use to list sstables' components wherever they are.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, when a view update backlog of one replica is full, the write is still sent by the coordinator to all replicas. Because of the backlog, the write fails on the replica, causing inconsistency that needs to be fixed by repair. To avoid these inconsistencies, this patch adds a check on the coordinator for overloaded replicas. As a result, a write may be rejected before being sent to any replicas and later retried by the user, when the replica is no longer overloaded.
This patch does not remove the replica write failures, because we still may reach a full backlog when more view updates are generated after the coordinator check is performed and before the write reaches the replica.
Fixesscylladb/scylladb#17426Closesscylladb/scylladb#18334
* github.com:scylladb/scylladb:
mv: test the view update behavior
mv: add test for admission control
storage_proxy: return overloaded_exception instead of throwing
mv: reject user requests by coordinator when a replica is overloaded by MVs
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.
Batchlog manager cannot be initialized before repair as we have the
dependencies chain:
repair_service -> storage_service::join_cluster -> batchlog_manager.
Throw if batchlog manager isn't initialized. That won't cause repair
to fail.
This series fixes an issue where histogram Summaries return an infinite value.
It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.
The series adds a test for summaries with a specific test case for this scenario.
Fixes#20255
Need backport to 6.0, 6.1 and 2023.1 and above
Closesscylladb/scylladb#20257
* github.com:scylladb/scylladb:
test/estimated_histogram_test Add summary tests
utils/histogram.hh: Make summary support inifinite bucket.
instead of using the default `argparse.HelpFormatter`, let's
use `ArgumentDefaultsHelpFormatter`, so that the default values
of options are displayed in the help messages.
this should help developer understand the behavior of the script
better.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20262
to achieve feature parity with our existing building system, we
need to implement a new build target "dist-check" in the CMake-based
building system.
in this change, "dist-check" is added to CMake-based building system.
unlike the rules generated by `configure.py`, the `dist-check` target
in CMake depends on the dist-*-rpm targets. the goal is to enable user
to test `dist-check` without explicitly building the artifacts being
tested.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20266
in 57def6f1, we specified "package-mode" for poetry, but this option
was introduced in poetry 1.8.0, as the "non-package" mode support.
see https://github.com/python-poetry/poetry/releases/tag/1.8.0
this change practically bumps up the minimum required poetry version
to 1.8.0, we did update `pyproject.tombl` to reflect this change.
but wefailed to update the `Makefile`.
in this change, we update `Makefile` to ensure that user which happens
have an older version of poetry can install the version which supports
this version when running `make setupenv`.
Refs scylladb/scylladb#20284
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20286
we switched from `circular_buffer` to `chunked_fifo` to present
`io_sink::_pending_io` in the latest seastar now. to be prepared for
this change, let's
* add `chunked_fifo` class in `scylla-gdb.py`.
* use `circular_buffer` as a fallback of `chunked_fifo`. instead of
doing this the other way around, we try to send the message that
the latest seastar uses `chunked_fifo`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20280
Add parameter --cluster-pool-size that can control pool size for all
PythonTestSuite tests. By default, the pool size set to 10 for most of
the suites, but this is too much for laptops. So this parameter can be
used to lower the pool size and not to freeze the system. Additionally,
the environment variable CLUSTER_POOL_SIZE was added for a convenient way
to limit pool size in the system without the need to provide each time an
additional parameter.
Related: https://github.com/scylladb/scylladb/pull/20276Closesscylladb/scylladb#20289
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757Closesscylladb/scylladb#19758
* github.com:scylladb/scylladb:
abstract_replication_strategy: make get_ranges async
database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
alternator: ttl: can pass const gms::gossiper& to ranges_holder
alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
alternator: ttl: refactor token_ranges_owned_by_this_shard
Removed people that no longer contribute to the scylladb.git and added/substituted reviewers responsible for maintaining the frontend components.
No need to backport, this is just an information for the github tool.
Closesscylladb/scylladb#20136
* github.com:scylladb/scylladb:
codeowners: add appropriate reviewers to the cluster components
codeowners: add appropriate reviewers to the frontend components
codeowners: fix codeowner names
codeowners: remove non contributors
table_helper has some quite awkward code, improve it a little.
Code cleanup, so no reason to backport.
Closesscylladb/scylladb#20194
* github.com:scylladb/scylladb:
table_helper: insert(): improve indentation
table_helper: coroutinize insert()
table_helper: coroutinize cache_table_info()
table_helper: extract try_prepare()
Currently "table removal" is logged as a reason of compaction stop for table drop,
tablet cleanup and tablet split. Modify log to reflect the reason.
Closesscylladb/scylladb#20042
* github.com:scylladb/scylladb:
test: add test to check compaction stop log
compaction: fix compaction group stop reason
we need to test the installation of cqlsh rpm. also, we should use the
correct paths of the generated rpm packages.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
cmake uses a different arrangement, so let's check for the existence
of the build directory and fallback to cmake's build directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the centos image repos on docker has been deprecated, and the
repo for centos7 has been removed from the main CentOS servers.
so we are either not able to install packages from its default repo,
without using the vault mirror, or no longer to pull its image
from dockerhub.
so, in this change
* we switch over to rockylinux:9, which is the latest stable release
of rockylinux, and rockylinux is a popular clone of RHEL, so it
matches our expectation of a typical use case of scylla.
* use dnf to manage the packages. as dnf is the standard way to manage
rpm packages in modern RPM-based distributions.
* do not install deltarpm.
delta rpms are was not supported since RHEL8, and the `deltarpm` package
is not longer available ever since. see
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html-single/considerations_in_adopting_rhel_8/index#ref_the-deltarpm-functionality-is-no-longer-supported_notable-changes-to-the-yum-stack
as a sequence, this package does not exist in Rockylinux-9.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
when building with the rules generated from `configure.py`,
these files are located under tools' own build directory.
so correct them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
if SELinux is enabled on the host, we'd have following failure when
running `dist-check.sh`:
```
+ podman run -i --rm -v /home/kefu/dev/scylladb:/home/kefu/dev/scylladb docker.io/centos:7 /bin/bash -c 'cd /home/kefu/dev/scylladb && /home/kefu/dev/scylladb/tools/testing/dist-check/docker.io/centos-7.sh --mode debug'
/bin/bash: line 0: cd: /home/kefu/dev/scylladb: Permission denied
```
to address the permission issue, we need to instruct podman to
relabel the shared volume, so that the container can access
the shared volume.
see also https://docs.podman.io/en/stable/markdown/podman-pod-create.1.html#volume-v-source-volume-host-dir-container-dir-options
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, none of the target generated by CMake-based
building system runs `test.py`. but `build.ninja` generated directly
by `configure.py` provides a target named `test`, which runs the
`test.py` with the options passed to `configure.py`.
to be more compatible with the rules generated by `configure.py`,
in this change
* do not include "CTest" module, as we are not using CTest for
driving tests. we use the homebrew `test.py` for this purpose.
more importantly, the target named "test" is provided by "CTest".
so in order to add our own "test" target, we cannot use "CTest"
module.
* add a target named "test" to run "test.py".
* add two CMake options so we can customize the behavior of "test.py",
this is to be compatible with the existing behavior of `configure.py`.
Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20263
This adds minimal implementation of the start-backup API call.
The method starts a task that uploads all files from the given keyspace's snapshot to the requested endpoint/bucket. Arguments are:
- endpoint -- the ID in object_store.yaml config file
- bucket -- the target bucket to put objects into
- keyspace -- the keyspace to work on
- snapshot -- the method assumes that the snapshot had been already taken and only copies sstables from it
The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion (hint: it's good to have non-zero TTL value to make sure fast backups don't finish before the caller manages to call wait_task API).
Sstables components are scanned for all tables in the keyspace and are uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.
refs: #18391Closesscylladb/scylladb#19890
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add backup integration
docs: Document the new backup method
test/object_store: Test that backup task is abortable
test/object_store: Add simple backup test
test/object_store: Move format_tuples()
test/pylib: Add more methods to rest client
backup-task: Make it abortable (almost)
code: Introduce backup API method
database: Export parse_table_directory_name() helper
database: Introduce format_table_directory_name() helper
snapshot-ctl: Add config to snapshot_ctl
snapshot-ctl: Add sstables::storage_manager dependency
snapshot-ctl: Maintain task manager module
snapshot-ctl: Add "snapshots" logger
snapshot-ctl: Outline stop() method and constructor
snapshot-ctl: Inline run_snapshot_list<>
test/cql_test_env: Export task manager from cql test env
task_manager: Print task ttl on start (for debugging)
docs: Update object_storage.md with AWS_ environment
docs: Restructure object_storage.md
since the rules generated by `configure.py` has this target, we need
to have an equivalent target as well in CMake-based buidling system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20265
Increase pool size changes were recently reverted because of the flakiness for the test_gossip_boot test. Test started
to fail on adding the node to the cluster without any issues in the Scylla log file. In test logs it looked like the
installation process for the new node just hanged. After investigating the problem, I've found out that the issue is that
test.py was draining the io_executor pool for cleaning the directory during install that was set to eight workers. So
to fix the issue, io_executor pool should be increased to more or less the same ratio as it was: doubled cluster pool size.
Closesscylladb/scylladb#20276
so that we can use {fmt} with it without the help of fmt::streamed.
also since we have a proper formatter for replication_strategy_type,
let's implement
`formatter<vnode_effective_replication_map::factory_key>`
as well.
since there are no callers of these two operator<<, let's drop
them in this change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20248
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.
Prepare for making get_keyspace_local_ranges async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add static `make` methods to ranges_holder_{primary,secondary}
and use them to make the ranges objects and pass them
to `token_ranges_owned_by_this_shard`, rather than letting
token_ranges_owned_by_this_shard invoke the right constructor
of the ranges_holder class.
Prepare for making `make` async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than holding a variant member (and defining
both ranges_holder_{primary,secondary} in both
specilizations of the class, just make the internal
ranges_holder class first-class citizens
and parameterize the `token_ranges_owned_by_this_shard`
template by this class type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
table_helper::cache_table_info() is fairly convoluted. It cannot be
easily coroutinized since it invokes asynchronous functions in a
catch block, which isn't supported in coroutines. To start to break it
down, extract a block try_prepare() from code that is called twice. It's
both a simplification and a first step towards coroutinization.
The new try_prepare() can return three values: `true` if it succeeded,
`false` if it failed and there's the possibility of attempting a fallback,
and an exception on error.
The `keyspace_compaction` method incorrectly appends the column family
parameter to the URL using a regular string, `"?cf={table}"`, instead of
an f-string, `f"?cf={table}"`. As a result, the column family name is
sent as `{table}` to the server, causing the compaction request to fail.
Fix this issue by passing the parameter to the POST request using a
dictionary instead of appending it to the URL.
Fixes#20264
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20243
before this change, we use the default options when creating `test_env`,
and the default options enable `use_uuid`. but the modes of
`perf-sstables` involving reads assumes that the identifiers are
deterministic. so that the previously written sstables using the "write"
mode can be read with the modes like "index_read", which just uses
`test_env::make_sstable()` in `load_sstables()`, and under the hood,
`test_env::make_sstable()` uses `test_env::new_generation()` for
retrieving the next identifier of sstable. when using integer-base
identifier, this works. as the sstable identifiers are generated
from a monotonically increasing integer sequence, where the identifiers
are deterministic. but this does not apply anymore when the UUID-based
identifiers are used, as the identifiers are generated with a
pseudorandom generator of UUID v1.
in this change, to avoid relying on the determinism of the integer-based
sstable identifier generation, we enumerate sstables by listing the
given directory, and parse the path for their identifier.
after this change, we are able to support the UUID-based sstable
identifier.
another option is disable the UUID-based sstable identifier when
loading sstables. the upside is that this approach is minimal and
straightforward. but the downside is that it encodes the assumption
in the algorithm implicitly, and could be confusing -- we create
a new generation for loading an existing sstable with this generation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20183
Now the endpoint hanler gets the value from db::config which is not nice
from several perspectives. First, it gets config (ab)using database.
Second, it's compaction manager that "knows" its throughput, global
config is the initial source of that information.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20173
Currently the major compaction task impl grabs this (non-updateable)
value from db::config. That's not good, all services including
compaction manager have their own configs from which they take options.
Said that, this patch puts the said option onto
compaction_manager::config, makes use of it and configures one from
db::config on start (and tests).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20174
This is prerequisite for "restore from object storage" feature. In order to collect the sstables in bucket one would need to list the bucket contents with the given prefix. The ListObjectsV2 provides a way for it and here's the respective s3::client extension.
Closesscylladb/scylladb#20120
* github.com:scylladb/scylladb:
test: Add test for s3::client::bucket_lister
s3_client: Add bucket lister
s3_client: Encode query parameter value for query-string
in 947e2814, we pass `--tty` as long as we are using podman _or_
we are in interactive mode. but if we build the tree using podman
using jenkins, we are seeing that ninja is displaying the output
as if it's in an interactive mode. and the output includes ASCII
escape codes. this is distracting.
the reason is that we
* are using podman, and
* ninja tells if it should displaying with a "smart" terminal by
checking istty() and the "TERM" environmental variable.
so, in this change, we add --tty only if
* we are in the interactive mode.
* or stdin is associated with a terminal. this is the use case
where user uses dbuild to interactively build scylla
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20196
in 30e82a81, we add a contraint to the template parameter of
boost_test_print_type() to prevent it from being matched with
types which can be formatted with operator<<. but it failed to
work. we still have test failure reports like:
```
[Exception] - critical check ['s', 's', 't', '_', 'm', 'r', '.', 'i', 's', '_', 'e', 'n', 'd', '_', 'o', 'f', '_', 's', 't', 'r', 'e', 'a', 'm', '(', ')'] has failed
```
this is not what we expect. the reason is that we passed the template
parameters to the `has_left_shift` trait in the wrong order, see
https://live.boost.org/doc/libs/1_83_0/libs/type_traits/doc/html/boost_typetraits/reference/has_left_shift.html.
we should have passed the lhs of operator<< expression as first
parameter, and rhs the second.
so, in this change, we correct the type constraint by passing the
template parameter in the right order, now the error message looks
better, like:
```
test/boost/mutation_query_test.cc(110): error: in "test_partition_query_is_full": check !partition_slice_builder(*s) .with_range({}) .build() .is_full() has failed
```
it turns out boost::transformed_range<> is formattable with operator<<,
as it fulfills the constraints of `boost::has_left_shift<ostream, R>`,
but when printing it, the compiler fails when it tries to insert the
elements in the range to the output stream.
so, in order to workaround this issue, we add a specialization for
`boost::transformed_range<F, R`.
also, to improve the readability, we reimplement the `has_left_shift<>`
as a concept, so that it's obvious that we need to put both the output
stream as the first parameter.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20233
This patch adds tests for summary calculation. It adds two tests, the
first is a basic calculation for P50, P95, P99 by adding 100 elements
into 20 buckets.
The second test look that if elements are found in the infinite bucket,
the result would be the lower limit (33s) and not infinite.
Relates to #20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch handles an edge cases related to The infinite bucket
limit.
Summaries are the P50, P95, and P99 quantiles.
The quantiles are calculated from a histogram; we find the bucket and
return its upper limit.
In classic histograms, there is a notion of the infinite bucket;
anything that does not fall into the last bucket is considered to be
infinite;
with quantile, it does not make sense. So instead of reporting infinite
we'll report the bucket lower limit.
Fixes#20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
as we have an API for backup a keyspace, let's expose this feature
with nodetool. so we can exercise it without the help of scylla-manager
or 3rd-party tools with a user-friendly interface.
in this change:
* add a new subcommand named "backup" to nodetool
* add test to verify its interaction with the API server
* add two more route to the REST API mock server, as
the test is using /task_manager/wait_task/{task_id} API.
for the sake of completeness, the route for
/task_manager/{part1} is added as well.
* update the document accordingly.
* the bash completion script is updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
It starts similarly to simpl backup test, but injects a pause into the
task once a single file is scheduled for upload, then aborts the task,
waits for it to fail, and check that _not_ all files are uploaded.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test shows how to backup a keyspace:
- flush
- take snapshot
- start backup with the new API method
- wait for the task to finish
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Namely:
- POST /storage_service/snapshots to take snapshot on a ks
- GET /task_manager/get_task_status/{id} to get status of a running task
- GET /task_manager/wait_task/{id} to wait for a task to finish
- POST /task_manager/abort_task/{id} to abort a running task
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make the impl::is_abortable() return 'yes' and check the impl::_as in
the files listing loop. It's not real abort, since files listing loop is
expected to be fast and most of the time will be spent in s3::client
code reading data from disk and sending them to S3, but client doesn't
support aborting its requests. That's some work yet to be done.
Also add injection for future testing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method starts a task that uploads all files from the given
keyspace's snapshot to the requested endpoint/bucket. The task runs in
the background, its task_id is returned from the method once it's
spawned and it should be used via /task_manager API to track the task
execution and completion (hint: it's good to have non-zero TTL value to
make sure fast backups don't finish before the caller manages to call
wait_task API).
If snapshot doesn't exist, nothing happens (FIXME, need to return back
an error in that case).
If endpoint is not configured locally, the API call resolves with
bad-request instantly.
Sstables components are scanned for all tables in the keyspace and are
uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.
Task is not abortable (FIXME -- to be added) and doesn't really report
its progress other than running/done state (FIXME -- to be added too).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's parse_table_directory_name() static helper in database.cc code
that is used by methods that parse table tree layout for snapshot.
Export this helper for external usage and rename to fit the format_...
one introduced by previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one makes table directory (not full path) out of table name and
uuid. This is to be symmetrical with yet another helper that converts
dirctory name back to table name and uuid (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pretty much all services in Scylla have their own config. Add one to
snapshot-ctl too, it will be populated later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_manager maintains set of clients to configured object
storage(s). The snapshot ctl is going to spawn tasks that will talk to
those storages, thus it needs the storage manager to get the clients
from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is going to start tasks managed by task manager. For that,
it should have its module set up and registered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper will be used by a code from another .cc file, so the
template needs to be in header for smooth instantiation
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the doc assumes that object storage can only be used to keep
sstables on it. It's going to change, restructure the doc to allow for
more usage scenarios.
Currently it doesn't, one of the node crashes with std::out_of_range
exception and meaningless calltrace
[Botond]: this test checks the case of reading a partition via
MUTATION_FRAGMENTS from a node which doesn't own said partition.
refs: #18786
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.
Since a native reversed format is used for reversed queries,
additional tests with restrictions on clustering keys are required
to capture possible errors like https://github.com/scylladb/scylladb/issues/20191
earlier than in dtests.
Add parametrization to the following tests:
+ test_query_reverse
+ test_query_reverse_paging
to accept a comparison operator used in selection criteria for a Query
operation.
compaction_manager::remove passes "table removal" as a reason
of stopping ongoing compactions, but currently remove method
is also called when a tablet is migrated or split.
Pass the actual reason of compaction stop, so that logs aren't
misleading.
This reverts commit cc428e8a36. It causes
may spurious CI failures while nodes are being torn down. Revert it until
the root cause is fixed, after which it can be reinstated.
Fixes#20116.
Now, when each shard storage_group_manager keeps
only the storage_groups for the tablet replica it owns,
we can simple return the storage_group map size
instead of counting the number of tablet replicas
mapped to this shard.
Add a unit test that sums the tablet count
on all shards and tests that the sum is equal
to the configured default `initial_tablets.
Fixes#18909
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20223
…utations vector
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls when freed.
For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
(inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
(inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```
This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time, and by that, reducing memory footprint and preventing reactor stalls.
Fixes#18173Closesscylladb/scylladb#18174
* github.com:scylladb/scylladb:
schema_tables: calculate_schema_digest: filter the key earlier
schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
Additional log prints information on the read query being executed.
It lists information like whether the query is a reversed one or
not, and table_schema and query_schema versions.
When executing reversed queries, a native revered format shall be used.
Therefore the table schema and the clustering key bounds are reversed
before a partition slice and a read command are constructed.
Similarly as for cql3::statements::select_statement.
In order to increase readability, a schema variable is renamed to
a table_schema to emphesize a table schema is passed to the function
and used across it.
Allows us to introduce a query_schema variable in the next patch.
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.
Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.
Analogically for remove_table.
Fixes: #19833.
Closesscylladb/scylladb#20064
io_fiber/store_snapshot_descriptor now gets the actual number of items
preserved when the log is truncated, fixing extra entries remained after
log snapshot creation. Also removes incorrect check for the number of
truncated items in the
raft_sys_table_storage::store_snapshot_descriptor.
Minor change: Added error_injection test API for changing snapshot thresholds settings.
Fixesscylladb/scylladb#16817Fixesscylladb/scylladb#20080Closesscylladb/scylladb#20095
* github.com:scylladb/scylladb:
raft: Ensure const correctness in applier_fiber.
raft: Invoke store_snapshot_descriptor with actually preserved items.
raft: Use raft_server_set_snapshot_thresholds in tests.
raft: Fix indentation in server.cc
raft: Add a test to check log size after truncation.
raft: Add raft_server_set_snapshot_thresholds injection.
utils: Ensure const correctness of injection_handler::get().
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.
Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.
Fixes#16826
Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.
* Requires backport to live versions. Issue hit in the filed with 2022.2.14
Closesscylladb/scylladb#16827
* github.com:scylladb/scylladb:
repair: do_rebuild_replace_with_repair: use source_dc only when safe
repair: replace_with_repair: pass the replace_node downstream
repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
repair: replace_rebuild_with_repair: pass ks_erms from caller
nodetool: rebuild: add force option
Add and use utils::optional_param to pass source_dc
- raft_sys_table_storage::store_snapshot_descriptor now receives a number of
preserved items in the log, rather than _config.snapshot_trailing value;
- Incorrect check for truncated number of items in store_snapshot_descriptor
was removed.
Fixesscylladb/scylladb#16817Fixesscylladb/scylladb#20080
Replace raft_server_snapshot_reduce_threshold with raft_server_set_snapshot_thresholds in tests
as raft_server_set_snapshot_thresholds fully covers the functionality of raft_server_snapshot_reduce_threshold.
before this change, `scylla sstable shard-of` didn't support tablets,
because:
- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges
in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted
This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.
Fixesscylladb/scylladb#16488
---
no need to backport, it's an improvement, not a critical fix.
Closesscylladb/scylladb#20002
* github.com:scylladb/scylladb:
tools: enhance `scylla sstable shard-of` to support tablets
replica/tablets: extract tablet_replica_set_from_cell()
tools: extract get_table_directory() out
tools: extract read_mutation out
build: split the list of source file across multiple line
tools/scylla-sstable: print warning when running shard-of with tablets
The include flag directive now treats missing content as info logs instead of warnings. This prevents build failures when the enterprise-specific content isn't yet available.
If the enterprise content is undefined, the directive automatically loads the open-source content. This ensures the end user has access to some content.
address comments
Closesscylladb/scylladb#19804
in 5ce07e5d84, the target named "storage_proxy.o" was added for
training the build of clang. but the rule for building this target
has two flaws:
* it was added a dependency of the "all" target, but we don't need
to build `storage_proxy.cc` twice when building the tree in the
regular build job. we only need to build it when creating the
profile for training the build of clang.
* it misses the include directory of abseil library. that's why we
have following build failure when building the default target:
```
[2024-08-18T14:58:37.494Z] /usr/local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/jenkins/workspace/scylla-master/scylla-ci/scylla -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/seastar/gen/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/seastar/gen/src -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/gen -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o -MF service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o.d -o service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o -c /jenkins/workspace/scylla-master/scylla-ci/scylla/service/storage_proxy.cc
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/service/storage_proxy.cc:17:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/db/commitlog/commitlog.hh:19:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/db/commitlog/commitlog_entry.hh:15:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/mutation/frozen_mutation.hh:15:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/mutation/mutation_partition_view.hh:16:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/build/gen/idl/mutation.dist.impl.hh:14:
[2024-08-18T14:58:37.495Z] /jenkins/workspace/scylla-master/scylla-ci/scylla/serializer_impl.hh:20:10: fatal error: 'absl/container/btree_set.h' file not found
[2024-08-18T14:58:37.495Z] 20 | #include <absl/container/btree_set.h>
[2024-08-18T14:58:37.495Z] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2024-08-18T14:58:37.495Z] 1 error generated.
```
* if user only enables "dev" mode, we'd have:
```
CMake Error at service/CMakeLists.txt:54 (add_library):
No SOURCES given to target: storage_proxy.o
```
so, in this change, we
* exclude this target from "all"
* link this target against abseil header library, so it has access
to the abseil library. please note, we don't need to build an
executable in this case, so the header would suffice.
* add a proxy target to conditionally enable/disable this target.
as CMake does not support generator expression in `add_dependencies()`
yet at the time of writing.
see https://gitlab.kitware.com/cmake/cmake/-/issues/19467
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20195
~~~
utils/tagged_integer: remove conversion to underlying integer
Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.
We could make the conversion operator explicit, for enforcing
static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)
in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.
~~~
No backport needed (refactoring).
The series is supposed to solve #20081.
Two patches in the series touch up code that is known to be (orthogonally) buggy; see
- `service/raft_sys_table_storage: tweak dead code` (#20080)
- `test/raft/replication: untag index_t in test_case::get_first_val()` (#20151)
Fixes for those (independent) issues will have to be rebased on this series, or this series will have to be rebased on those (due to context conflicts).
The series builds at every stage. The debug and release unit test suites pass at the end.
Closesscylladb/scylladb#20159
* github.com:scylladb/scylladb:
utils/tagged_integer: remove conversion to underlying integer
test/raft/randomized_nemesis_test: clean up remaining index_t usage
test/raft/randomized_nemesis_test: clean up index_t usage in store_snapshot()
test/raft/replication: clean up remaining index_t usage
test/raft/replication: take an "index_t start_idx" in create_log()
test/raft/replication: untag index_t in test_case::get_first_val()
test/raft/etcd_test: tag index_t and term_t for comparisons and subtractions
test/raft/fsm_test: tag index_t and term_t for comparisons and subtractions
test/raft/helpers: tighten compare_log_entries() param types
service/raft_sys_table_storage: tweak dead code
service/raft_sys_table_storage: simplify (snap.idx - preserve_log_entries)
service/raft_sys_table_storage: untag index_t and term_t for queries
raft/server: clean up index_t usage
raft/tracker: don't drop out of index_t space for subtraction
raft/fsm: clean up index_t and term_t usage
raft/log: clean up index_t usage
db/system_keyspace: promise a tagged integer from increment_and_get_generation()
gms/gossiper: return "strong_ordering" from compare_endpoint_startup()
gms/gossiper: get "int32_t" value of "gms::version_type" explicitly
It is unsafe to restrict the sync nodes for repair to
the source data center if we cannot guarantee a quorum
in the data center with network-topology replication strategy.
This change restricts the usage of source_dc in the following cases:
1. For SimpleStrategy - source_dc is ignored since there is no guarantee
that it contains remaining replicas for all tokens.
2. For EverywhereStrategy - use source_dc if there are remaining
live nodes in the datacenter.
3. For NetworkTopologyStrategy:
a. It is considered unsafe to use source_dc if number of nodes
lost in that DC (replaced/rebuilt node + additional ignored nodes)
is greater than 1, or it has 1 lost node and rf <= 1 in the DC.
b. If the source_dc arg is forced, as with the new
`nodetool rebuild --force <source_dc>` option,
we use it anyway, even if it's considered to be unsafe.
A warning is printed in this case.
c. If the source_dc arg is user-provided, (using nodetool rebuild),
an error exception is thrown, advising to use an alternative dc,
if available, omit source_dc to sync with all nodes, or use the
--force option to use the given source_dc anyhow.
d. Otherwise, we look for an alternative source datacenter,
that has not lost any node. If such datacenter is found
we use it as source_dc for the keyspace, and log a warning.
e. If no alternative dc is found (and source_dc is implicit), then:
log a warning and fall back to using replicas from all nodes in the cluster.
Fixes#16826
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The callers already pass ignore_nodes as host_id:s
and we translate them into inet_address only for repair
so delay the translation as much as posible,
Refs scylladb/scylladb#6403
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The keyspaces replication maps must be in sync with the
token_metadata_ptr passed already to the functions,
so instead of getting it in the callee, let the caller
get the ks_erms along with retrieving the tmptr.
Note that it's already done on the rebuild path
for streaming based rebuild.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used to force usage of source_dc, even
when it is unsafe for rebuild.
Update docs and add test/nodetool/test_rebuild.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.
This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This commit extracts the information about the default for tables in keyspace creation
to a separate file in the _common folder. The file is then included using
the scylladb_include_flag directive.
The purpose of this commit is to make it possible to include a different file
in the scylla-enterprise repo - with a different default.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585Closesscylladb/scylladb#20181
would be more helpful, if the output of "--help" command line can
include the default value of options.
so, in this change, we include the default values in it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20170
before this change, we check for the existence of "TestSuite" node
under the root of XML tree, and then enumerating all "TestSuite" nodes
under this "TestSuite", this approach works. but it
* introduces unnecessary indent
* is not very readable
in this change, we just use "./TestSuite/TestSuite" for enumerating
all "TestSuite" nodes under "TestSuite". simpler this way.
---
it's a cleanup in the test driver script, hence no need to backport.
Closesscylladb/scylladb#20169
* github.com:scylladb/scylladb:
test.py: fix the indent
test.py: use XPath for iterating in "TestSuite/TestSuite"
Alternator already supports **authentication** - the ability to to sign each request as a particular user. The users that can be used are the different "roles" that are created by CQL "CREATE ROLE" commands. This series adds support for **authorization**, i.e., the ability to determine that only some of these roles are allowed to read or write particular tables, to create new tables, and so on.
The way we chose to do this in this series is to support CQL's existing role-based access control (RBAC) commands - GRANT and REVOKE - on Alternator tables. For example, an Alternator table "xyz" is visible to CQL as "alternator_xyz.xyz", so a `GRANT SELECT ON alternator_xyz.xyz TO myrole` will allow read commands (e.g., GetItem) on that table, and without this GRANT, a GetItem will fail with `AccessDeniedException`.
This series adds the necessary checks to all relevant Alternator operations, and also adds extensive functional testing for this feature - i.e., that certain DynamoDB API operations are not allowed without the appropriate GRANTs.
The following permissions are needed for the following Alternator API operations:
* **SELECT**: `GetItem`, `Query`, `Scan`, `BatchGetItem`, `GetRecords`
* **MODIFY**: `PutItem`, `DeleteItem`, `UpdateItem`, `BatchWriteItem`
* **CREATE**: `CreateTable`
* **DROP**: `DeleteTable`
* **ALTER**: `UpdateTable`, `TagResource`, `UntagResource`, `UpdateTimeToLive`
* _none needed_: `ListTables`, `DescribeTable`, `DescribeEndpoints`, `ListTagsOfResource`, `DescribeTimeToLive`, `DescribeContinuousBackups`, `ListStreams`, `DescribeStream`, `GetShardIterator`
Currently, I decided that for consistency each operation requires one permission only. For example, PutItem only requires MODIFY permission. This is despite the fact that in some cases (namely, `ReturnValues=ALL_OLD`) it can also _read_ the item. We should perhaps discuss this decision - and compare how it was done in CQL - e.g., what happens in LWT writes that may return old values?
Different permissions can be granted for a base table, each of its views, and the CDC table (Alternator streams). This adds power - e.g., we can allow a role to read only a view but not the base table, or read the table but not its history. GRANTing permissions on views or CDC logs require knowing their names, which are somewhat ugly (e.g., the name of GSI "abc" in table "xyz" is `alternator_xyz.xyz:abc`). But usefully, the error message when permissions are denied contains the full name of the table that was lacking permissions and which permissions were lacking, so users can easily add them.
In addition to permissions checking, this series also correctly supports _auto-grant_ (except #19798): When a role has permissions to `CreateTable`, any table it creates will automatically be granted all permissions for this role, so this role will be able to use the new table and eventually delete it. `DeleteTable` does the opposite - it removes permissions from tables being deleted, so that if later a second user re-creates a table with the same name, the first user will not have permissions over the new table.
The already-existing configuration parameter `alternator_enforce_authorization` (off by default), which previously only enabled authentication, now also enables authorization. Users that upgrade to the new version and already had `alternator_enforce_authorization=true` should verify that the users they use to authenticate either have the appropriate permissions or the "superuser" flag. Roles used to authenticate must also have the "login" flag.
Please note that although the new RBAC support implements the access control feature we asked for in #5047, this implementation is _not compatible_ with DynamoDB. In DynamoDB, the access control is configured through IAM operations or through the new `PutResourcePolicy` - operation, not through CQL (obviously!). DynamoDB also offers finer access-control granularity than we support (Scylla's RBAC works on entire tables, DynamoDB allows setting permissions on key prefixes, on individual attributes, and more). Despite this non-compatibility, I believe this feature, as is, will already be useful to Alternator users.
Fixes#5047 (after closing that issue, a new clean issue should be opened about the DynamoDB-compatible APIs that we didn't do - just so we remember this wasn't done yet).
New feature, should not be backported.
Closesscylladb/scylladb#20135
* github.com:scylladb/scylladb:
tests: disable test_alternator_enforce_authorization_true
test, alternator: test for alternator_enforce_authorization config
test/pylib: allow setting driver_connect() options in servers_add()
test: fix test_localnodes_joining_nodes
alternator, RBAC: reproducer for missing CDC auto-grant
alternator: document the new RBAC support
alternator: add RBAC enforcement to GetRecords
test/alternator: additional tests for RBAC
test/alternator: reduce permissions-validity-in-ms
test/alternator: add test for BatchGetItem from multiple tables
alternator: test for operations that do not need any permissions
alternator: add RBAC enforcement to UpdateTimeToLive
alternator: add RBAC enforcement to TagResource and UntagResource
alternator: add RBAC enforcement to BatchGetItem
alternator: add RBAC enforcement to BatchWriteItem
alternator: add RBAC enforcement to UpdateTable
alternator: add RBAC enforcement to Query and Scan
alternator: add RBAC enforcement to CreateTable
alternator: add RBAC enforcement to DeleteTable
alternator: add RBAC enforcement to UpdateItem
alternator: add RBAC enforcement to DeleteItem
alternator: add RBAC enforcement to PutItem
alternator: add RBAC enforcement to GetItem
alternator: stop using an "internal" client_state
Consider the following:
```
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
**Please replace this line with justification for the backport/\* labels added to this PR**
Closesscylladb/scylladb#19427
* github.com:scylladb/scylladb:
tablets: Fix race between repair and split
compaction: Allow "offline" sstable to be split
The test is flaky and needs to be fixed in order to not randomly break
our CI, OTOH can be commented out for the time being, so that we can
marge the feature.
This patch adds tests that demonstrates the current way that Alternator's
authentication and authorization are both enabled or disabled by the
option "alternator_enforce_authorization".
If in the future we decide to change this option or eliminate it (e.g.,
remain just with the "authenticator" and "authorizer" options), we can
easily update these tests to fit the new configuration parameters and
check they work as expected.
Because the new tests want to start Scylla instances with different
configuration parameters, they are written in the the "topology"
framework and not in the test/alternator framework. The test/alternator
framework still contains (test/alternator/test_cql_rbac.py) the vast
majority of the functional testing of the RBAC feature where all those
tests just assume that RBAC is enabled and needs to be tested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The manager.driver_connect() functions allows to pass parameters when
creating the connection (e.g., a special auth_provider), but unfortunately
right now the servers_add() function always calls driver_connect()
without parameters. So in this patch we just add a new optional
parameter to servers_add(), driver_connect_opts, that will be passed to
driver_connect().
In theory instead of the new option to driver_connect() a caller can
pass start=False to servers_add() and later call driver_connect()
manually with the right arguments. The problem is that start=False
avoids more than just calling driver_connect(), so it doesn't solve
the problem.
An example of using the new option is to run Scylla with authentication
enabled, and then connect to it using the correct default account
("cassandra"/"cassandra"):
config = {
'authenticator': 'PasswordAuthenticator',
'authorizer': 'CassandraAuthorizer'
}
servers = await manager.servers_add(1, config=config,
driver_connect_opts={'auth_provider':
PlainTextAuthProvider(username='cassandra', password='cassandra')})
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing test
topology_experimental_raft/test_alternator::test_localnodes_joining_nodes
Tried to create a second server but *not* wait for it to complete, but
the trick it used (cancelling the task) doesn't work since commit 2ee063c
makes a list of unwaited tasks and waits for them anyway. The test
*appears* to work because it is the last test in the file, but if we
ever add another test in the same file (like I plan to do in the next
patch), that other test will find a "BROKEN" ScyllaClusterManager and
report that it failed :-(
Other tricks I tried to use (like killing the servers) also didn't work
because of various limitations and complications of the test framework
and all its layers.
So not wanting to fight the fragile testing framework any more at this
point, I just gave up and the test will *wait* for the second server
to come up. This adds 120 seconds (!) to the test, but since this whole
test file already takes more than 500 seconds to complete, let's bite
this bullet. Maybe in the future when the test framework improves, we can
avoid this 120 second wait.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a reproducing (xfailing) test for issue #19798, which
shows that if a role is able to create an Alternator table, the role is
able to read the new table (this is known as "auto-grant"), but is NOT
able to read the CDC log (i.e., use Alternator Streams' "GetRecords").
Once we do fix this auto-grant bug, it's also important to also implement
auto-revoke - the permissions on a deleted table must be deleted as well
(otherwise the old owner of a deleted table will be able to read a new
table with the same name). This patch also adds a test verifying that
auto-revoke works. This test currently passes (because there is no auto-
grant, so nothing needs to be revoked...) but if we'll implement auto-grant
and forget auto-revoke, the second test will start to fail - so I added
this test as a precaution against a bad fix.
Refs #19798
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In docs/alternator/compatibility.md we said that although Alternator
supports authentication, it doesn't support authorization (access
control). Now it does, so the relevant text needs to be corrected
to fit what we have today.
It's still in the compatibility.md document because it's not the same
API as DynamoDB's, so users with existing applications may need to be
aware of this difference.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "SELECT" permission on a table to
run a GetRecords on it (the DynamoDB Streams API, i.e., CDC).
The grant is checked on the *CDC log table* - not on the base table,
which allows giving a role the ability to read the base but not is
change stream, or vice versa.
The operations ListStreams, DescribeStreams, GetShardIterators do not
require any permissions to run - they do not read any data, and are
(in my opinion) similar in spirit to DescribeTable, so I think it's fine
not to require any permissions for them.
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Additional tests for support for CQL Role-Based Access Control (RBAC)
in Alternator:
1. Check that even in an Alternator table whose name isn't valid as CQL
table names (e.g., uses the dot character) the GRANT/REVOKE commands
work as expected.
2. Check that superuser roles have full permissions, as expected.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We set in test/cql-pytest/run.py, affecting test/alternator/run, the
configuration permissions_validity_in_ms by default to 100ms. This means
that tests that need to check how GRANT or REVOKE work always need to
sleep for more than 100ms, which can make a test with a lot of these
operations very slow.
So let's just set this configuration value to 5ms. I checked that it
doesn't adversely affect the total running speed of test/alternator/run.
This change only affects running tests through test/alternator/run, which
is expected to be fast. I left the default for test.py as it was, 100ms,
the latency of individual tests is less important there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
While working on the RBAC on BatchGetItem, I noticed that although
BatchGetItem may ask to read items from several tables, we don't have
a test covering this case! This patch fixes that testing oversight.
Note that for the write-side version of this operation, BatchWriteItem,
we do have tests that write to several tables in the same batch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Some operations, namely ListTables, DescribeTable, DescribeEndpoints,
ListTagsOfResource, DescribeTimeToLive and DescribeContinuousBackups
do not need any permissions to be GRANTed to a role.
Our rationale for this decision is that in CQL, "describe table" and
friends also do not require any permissions.
This patch includes a test that verifies that they really don't need
permissions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "ALTER" permission on a table to
run a UpdateTimeToLive on it. UpdateTimeToLive is similar in purpose to
UpdateTable, so it makes sense to use the same permission "ALTER" as we
do for UpdateTable.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "ALTER" permission on a table to
run the TagResource or UntagResource operations on it. CQL does not
have an exact parallel of DynamoDB's tagging feature, but our usual
use of tags as an extension of UpdateTable to change non-standard options
(e.g., write isolation policy or tablets setup), so it makes sense to
require the same permissions we require for UpdateTable - namely "ALTER".
A test for both operations is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "SELECT" permission on a table to
run a BatchGetItem on it. A single batch may ask to write to several
different tables, so we fail the entire batch with AccessDeniedException
if any of the tables mentioned in the batch do not have SELECT permissions
for this role.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a BatchWriteItem on it. A single batch may ask to write to several
different tables, so we fail the entire batch with AccessDeniedException
if any of the tables mentioned in the batch do not have MODIFY permissions
for this role.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "ALTER" permission on a table to
run a UpdateTable on it.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "SELECT" permission on a table to
run a Query or Scan on it.
Both Query and Scan operations call the same do_query() function, so the
permission checks are put there.
Note that Query can read from either the base table or one of its views,
and the permissions on the base and each of the views can be separate
(so we can allow a role to only read one view, for example).
Tests for all of the above are also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "CREATE" permission on ALL
KEYSPACES to run a CreateTable operation.
The CreateTable operation also performs so-called "auto-grant": When a
role creates a table, it is automatically granted full permissions to
read, write, change or delete that new table.
A test for all these things is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "DROP" permission on a table to
run a DeleteTable on it.
Moreover, when a table and its views are deleted, any special permissions
previously GRANTed on this table are removed. This is necessary because
if a role creates a table it is automatically granted permissions on this
table (this is known as "auto-grant" - see the CreateTable patch for
details). If this role deletes this table and later a second role creates
a table with the same name, we don't want the first role to have
permissions on this new table.
Tests for permission enforcements and revocation on delete are also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a UpdateItem on it.
Only the MODIFY permission is required, even if the operation may also
read the old value of the item, such as a read-modify-write operation
or even using ReturnValues='ALL_OLD'.
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a DeleteItem on it.
Only the MODIFY permission is required, even if the operation may also
read the old value of the item (using ReturnValues='ALL_OLD').
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a PutItem on it.
Only the MODIFY permission is required, even if the operation may also
read the old value of the item (using ReturnValues='ALL_OLD').
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this patch, we begin to add role-based access control (RBAC)
enforement to Alternator - in this patch only to GetItem.
After the preparation of client_state correctly in the previous patch,
the permission check itself in the get_item() function is very simple.
The bigger part of this patch is a full functional test in
test/alternator/test_cql_rbac.py. The test is quite self-explanatory
and heavily commented. Basically we check that a new role cannot
read with GetItem a pre-existing table, and we can add that ability
by GRANTing (in CQL) the new role the ability to SELECT the table,
the keyspace, all keyspaces, or add that ability to some other role
that this role inherits.
In the following patches, we will add role-based access control to
the Alternator operations, but the functional tests will be shorter -
we don't need to check the role inheritence, "all keyspaces" feature,
and so on, for every operation separately since they all use the
same underlying checking functions which handles these role inheritence
issues in exactly the same way.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Scylla uses a "client_state" object to encapsulate the information of
who the client is - its IP address, which user was authenticated, and so on.
For an unknown reason, Alternator created for each request an "internal"
client_state, meaning that supposedly the client for each request was
some sort of internal process (e.g., repair) rather than a real client.
This was wrong, and we even had a FIXME about not putting the client's
IP address in client_state.
So in this patch, we start using a normal "external" client_state
instead of an "internal" one. The client_state constructors are very
different in the two cases, so a few lines of code had to change.
I hope that this change will cause no functional changes. For example,
Alternator was already setting its own timeouts explicitly and not
relying on the default ones for external clients. However, we need to
fix this for the following patches which introduce permissions checks
(Role-Based Access Control - RBAC) - the client_state methods for
checking permissions become no-ops for *internal* clients (even if the
client_state contains an authenticated users). We need these functions
to do their job - so we need an *external* variant of client_state.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Commit 9f93dd9fa3 changed
`tablet_sstable_set::_sstable_sets` to be a `absl::flat_hash_map` and
in addition, `std::set<size_t> _sstable_set_ids` was
added. `_sstable_set_ids` is set up in the
`tablet_sstable_set(schema_ptr s, const storage_group_manager& sgm,
const locator::tablet_map& tmap)` constructor, but it is not copied in
`tablet_sstable_set(const tablet_sstable_set& o)`.
This affects the `tablet_sstable_set::tablet_sstable_set` method as it
depends on the copy constructor. Since sstable set can be cloned when
a new sstable set is added, the issue will cause ids not being copied
into the new sstable set. It's healed only after compaction, since the
sstable set is rebuilt from scratch there.
This PR fixes this issue by removing the existing copy constructor of
`tablet_sstable_set` to enable the implicit default copy constructor.
Fixes#19519Closesscylladb/scylladb#20115
* github.com:scylladb/scylladb:
boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
replica: fix copy constructor of tablet_sstable_set
This patch adds metrics for batch get_item and batch write_item.
The new metrics record summary and histogram for latencies and batch size.
Batch sizes are implemented as ever-growing counters. To get the average batch size divide the rate of
the batch size counter by the rate of the number of batch counter:
```rate(batch_get_item_batch_size)/rate(batch_get_item)```
Relates to #17615
New code, No need to backport
Closesscylladb/scylladb#20190
* github.com:scylladb/scylladb:
Add tests for Alternator batch operation metrics
alternator/executor: support batch latency and size metrics
Add metrics for Alternator get and write batch operations
This patch adds unit tests to verify the correctness of the newly
introduced histogram metrics for get and write batch operation
latencies.
The test uses the existing latency test with the added metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch Updated the get and write batch operations in Alternator to
record latency using the newly added histogram metrics.
It adds logic to increment the counters with the number of items
processed in each batch.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Introduced histogram metrics to track latency for Alternator's get and
write batch operations.
Added counters to record the number of items processed in each batch
operation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.
Fixes#19519
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
this change is a follow up of 06c60f6ab, which updated the
2nd step of the test to use switch-case, but missed the 1st step.
so this change updates the first step of the test as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we check for the existence of "TestSuite" node
under the root of XML tree, and then enumerating all "TestSuite" nodes
under this "TestSuite", this approach works. but it
* introduces unnecessary indent
* is not very readable
in this change, we just use "./TestSuite/TestSuite" for enumerating
all "TestSuite" nodes under "TestSuite". simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The method cannot find the TestSuite in the XML file and fails the whole job, however tests are passed. The issue was in incorrect understanding of boost summarization method. It creates one file for all modes, so there is no need to go through all modes to convert the XML file for allure.
Closes: https://github.com/scylladb/scylladb/issues/20161Closesscylladb/scylladb#20165
Currently, each frozen mutation we get from
system_keyspace::query_mutations is unfrozen in whole
to a mutation and only then we check its key with
the provided `accept_keyspace` function.
This is wasteful, since they key can be processed
directly form the frozen mutation, before taking
the toll of unfreezing it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls
when freed.
For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
(inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
(inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```
This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time,
and by that, reducing memory footprint and preventing
reactor stalls.
Fixes#18173
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, `scylla sstable shard-of` didn't support tablets,
because:
- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges
in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted
This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.
Fixesscylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the `get_table_directory()` function will have applications
beyond its current use in `schema_loader.cc`. its ability to locate
the directory storing the sstables of given table could be valuable
in other subcommand(s) implementation.
so, in this change we extract it out into a dedicated source file,
so that it accept the primary_key and an optional clustering_key.
Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the `read_mutation_from_table_offline()` function will have applications
beyond its current use in `schema_loader.cc`. its ability to parser
mutation data from sstables could be valuable in other subcommand(s)
implementation.
so, in this change we extract it out into a dedicated source file,
so that it accept the primary_key and an optional clustering_key.
Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Split the extended list of source files across multiple lines.
This improves readability and makes future additions easier to
review in diffs.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the subcommand of "shard-of" does not support tablets yet. so let's
print out an error message, instead of printing the mapping assuming
that the sstables are distributed based on token only.
this commit also adds two more command line options to this subcommand,
so that user is required to specify either "--vnodes" or "--tablets"
to instruct the tool how the cluster distributes the tokens across nodes
and their shards. this helps to minimize the suprise of user.
this change prepares for the succeeding changes to implement the tablets
support.
the corresponding test is updated accordingly so that it only exercises
the "shard-of" subcommand without tablets. we will test it with tablets
enabled in a succeeding change.
Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.
We could make the conversion operator explicit, for enforcing
static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)
in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "test/raft/randomized_nemesis_test.cc":
- addition of tagged and untagged (both should be tagged)
- taking the minimum of an index difference and a container size (both
should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
unpack and clean up the relatively complex
first_to_remain = max(snap.idx + 1 - preserve_log_entries, 0)
calculation in persistence::store_snapshot().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag the operands / arguments of the following operations, in
"test/raft/replication.hh":
- assignment to raft_cluster::_seen
- call to hasher_int::hash_range()
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
raft_cluster::get_states() passes a "start_idx" to create_log(), and
create_log() uses it as an "index_t" object. Match the type of "start_idx"
to its name.
This patch is best viewed with "git show -W".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In test_case::get_first_val(), the asssignment
first_val = initial_snapshots[initial_leader].snap.idx;
*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20151>.
For now, wean the buggy asssignment off the disappearing
tagged-to-untaggged conversion.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL() and BOOST_CHECK().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL(), BOOST_CHECK(). Clean up the first args of
read_quorum() calls -- stay in term_t space.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The "from" and "to" parameters of compare_log_entries() are raft log
indices; change them to raft::index_t, and update the callers.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In raft_sys_table_storage::store_snapshot_descriptor(), the condition
preserve_log_entries > snap.idx
*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20080>.
Ticket#20080 explains that this condition always evaluates to false in
practice, and that the "else" branch handles all cases correctly anyway.
For now, wean the buggy expression off the disappearing
tagged-to-untaggged conversion.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Currently, boost tests aren't using Junit. Enable Junit report output and clean them from skipped test, since boost tests are executed by function name rather than filename. This allows including boost tests result to the Allure report.
Related: https://github.com/scylladb/qa-tasks/issues/1665Closesscylladb/scylladb#19925
before this change, we assume user runs nodetool tests right under the root source directory. if user runs them under `test/nodetool`, the suppression rules are not applied. as the path is incorrect in that case.
after this change, the supression rules' path is deduced from the top src directory. so we can now run the nodetool test under `test/nodetool` .
---
no need to backport, this change improves developer's experience.
Closesscylladb/scylladb#20119
* github.com:scylladb/scylladb:
test/nodetool: deduce subpression path from top srcdir
test/nodetool: deduce path from top srcdir
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.
Closesscylladb/scylladb#20122
`tools/toolchain/optimized_clang.sh` builds this target for creating
the profile in order to build clang optimized with this profile data.
so let's be compatible with `configure.py`, and add this target to
CMake building system as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20105
The lock_table() method needs database, ks and cf to find the table on
all shards. The same can be achieved with the help of global_table_ptr
thing that all the core callers already have at hand.
There's a test that doesn't have global table, but it can get one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20139
Typically the sstable_directory is constructed out of a table object.
Some code, namely tests and schema-loader, don't have table at hand and
construct directory out of schema, sharder, path-to-sstables, etc. This
code doesn't work with any storage options other than local ones, so
there's no need (yet) to carry this argument over.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20138
before this change, we look up for the mode using the command line
option as the key, but that's incorrect if the command line option does
not match with any of the known names. in that case, `test_mode` just
create another pair of <sstring, test_modes>, and return the second
component of this pair. and the second component is not what we expect.
we should have thrown an exception.
in this change
* the test_mode map is marked const.
* the overloads for parsing / formatting the `test_modes` type are
added, so that boost::program_options can parse and format it.
after this change, we print more user friendly error, like
```
/scylla perf-sstable --mode index-foo
error: the argument ('index-foo') for option '--mode' is invalid
Try --help.
```
instead of a bunch of output which is printed as if we passes the correct option as the argument of the `--mode` option.
---
it's an improvement of developer experience, hence no need to backport.
Closesscylladb/scylladb#20140
* github.com:scylladb/scylladb:
test/perf/perf_sstable: use switch-case when appropriate
test/perf/perf_sstables: use test_modes as the type of its option
This change fixes#17237, fixes#5361 and fixes#5362 by passing the limit value down the call chain in cql3. A test is also added.
fixes#17237fixes#5361fixes#5362
The regression happened in 5.4 as we changed the way GROUP BY is processed in 432cb02 - to force aggregation when it is used. The LIMIT value was not passed to aggregations and thus we failed to adhere to it.
W want to backport this fix to 5.4 and 6.0 to have continuous correct results for the test case from #17237
This patch consists of 4 commits:
- fa4225ea0fac2057b7a9976f57dc06bcbd900cd4 - cql3: respect the user-defined page size in aggregate queries - a precondition for this patch to be implementable
- 8fbe69e74dca16ed8832d9a90489ca47ba271d0b - cql3/select_statement: simplify the get_limit function - the `do_get_limit()` function did a lot of legwork that should not be associated with it. This change makes it trivial and makes its callers do additional checks (for unset guards, or for an aggregate query)
- 162828194a2b88c22fbee335894ff045dcc943c9 - cql3: process LIMIT for GROUP BY queries - pass the limit value down the chain and make use of it. This is the actual fix to #17237
- b3dc6de6d6cda8f5c09b01463bb52f827a6a00b4 - test/cql-pytest: Add test for GROUP BY queries with LIMIT - tests
Closesscylladb/scylladb#18842
* github.com:scylladb/scylladb:
test/cql-pytest: Add test for GROUP BY queries with LIMIT
cql3: process LIMIT for GROUP BY queries
cql3/select_statement: simplify the get_limit function
cql3: respect the user-defined page size in aggregate queries
Drop half-reversed (legacy) format of query::partition_slice.
The select query builds a fully reversed (native) slice for reversed queries and use it together with a reversed
schema to construct query::read_command that is further propagated to the database.
A cluster feature is added to support nodes that still operate on half-reversed slices. When the feature is turned off:
- query::read_command is transformed (to have table schema and half-reversed slices) before sending to other nodes
- query::read_command is transformed (to have query schema (reversed) and reversed slices) after receiving it from other nodes
- Similarly, mutations are transformed. They are reversed before being sent to other nodes or after receiving them from other nodes.
Additional manual tests were performed to test a mixed-node cluster:
1. 3-node cluster with one node upgraded: reverse read queries performed on an old node
2. 3-node cluster with one node upgraded: reverse read queries performed on a new node
3. 3-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on an old node
4. 3-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on a new node
All reverse read queries above consists of:
- single-partition reverse reads with no clustering key restrictions, with single column restrictions and multi column restrictions both with and without paging turned on
- multi-partition reverse reads with range restrictions with optional partition limit and partial ordering
The exact same tests were also performed on a fully upgraded cluster.
Fixes https://github.com/scylladb/scylladb/issues/12557Closesscylladb/scylladb#18864
* github.com:scylladb/scylladb:
mutation_partition: drop reverse parameter in compact_for_query
clustering_key_filter: unify get_ranges and get_native_ranges
streamed_mutation_freezer: drop the reverse parameter
reverse-reads.md: Drop legacy reverse format information
Fix comments refering to half-reversed (legacy) slices
select_statement::do_execute: Add tracing informaction
query::trim_clustering_row_ranges_to: require reversed schema for native reversed ranges
query-request: Drop half_reverse_slice as it is no longer used anywhere
readers: Use reversed schema and native reversed slices
database: accept reversed schema for reversed queries
storage_proxy: Support reverse queries in native format
query_pagers: Replace _schema with _query_schema
query_pagers: Support reverse queries in native format
select_statement: Execute reversed query in native format
storage_proxy::remote: Add support for mixed-node clusters
mutation_query: Add reversed function to reverse reconcilable_result
query-request: Add reversed function to reverse read_command
features: add native_reverse_queries
kl::reader::make_reader: Unify interface with mx::reader::make_reader
config: drop reversed_reads_auto_bypass_cache
config: drop enable_optimized_reversed_reads
This commit updates the configuration for ScyllaDB documentation so that:
- 6.1 is the latest version.
- 6.1 is removed from the list of unstable versions.
It must be merged when ScyllaDB 6.1 is released.
No backport is required.
Closesscylladb/scylladb#20041
With conversion of tagged integers to untagged ones going away, replace
static_cast<uint64_t>(snap.idx)
with
snap.idx.value()
Furthermore, casting "preserve_log_entries" (of type "size_t") to
"uint64_t" is redundant (both "snap.idx" and "preserve_log_entries" carry
nonnegative values, and the mathematical difference is expected to be
nonnegative); remove the cast.
Finally, simplify the initialization syntax.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag index_t and term_t values in the following two contexts:
- when they are passed to CQL queries as int64_t,
- when they are default-constructed as fallbacks for int64_t fields
missing from CQL result sets.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/server.cc":
- addition of tagged and untagged (both should be tagged)
- subscripting an array by tagged (should be untagged)
- comparing a size-like threshold against tagged (should be untagged)
- exposing tagged via gauges (should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/fsm.cc":
- addition of tagged and untagged (both should be tagged)
- comparison (relop) between tagged an untagged (both should be tagged)
- subscripting or sizing an array by tagged (should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in raft/log.{cc,h}:
- addition of tagged and untagged (both should be tagged)
- comparison (relop) between tagged an untagged (both should be tagged)
- subscripting an array, or offsetting an iterator, by tagged (should be
untagged)
- comparing an array bound against tagged (should be untagged)
- subtracting tagged from an array bound (should be untagged)
Note: these files mix uniform initialization syntax (index_t{...}) with
constructor call syntax (index_t()), with the former being more frequent.
Stick with the former here too, for consistency.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Internally, increment_and_get_generation() produces a
"gms::generation_type" value.
In turn, all callers of increment_and_get_generation() -- namely
scylla_main() [main.cc] and single_node_cql_env::run_in_thread()
[test/lib/cql_test_env.cc] -- pass the resolved value to
storage_service::init_address_map() and storage_service::join_cluster(),
both of which take a "gms::generation_type".
Therefore it is pointless to "untag" the generation value temporarily
between the producer and the consumers. Correct the return type of
increment_and_get_generation().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The callers of gossiper::compare_endpoint_startup() need not (should not)
learn of any particular (tagged or untagged) difference of generations;
they only care about the ordering of generations. Change the return type
of compare_endpoint_startup() to "std::strong_ordering", and delegate the
comparison to tagged_tagged_integer::operator<=>.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In do_sort(), we need to drop to "int32_t" temporarily, so that we can
call ::abs() on the version difference. Do that explicitly.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.
Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.
Fixesscylladb/scylla-enterprise#4380Closesscylladb/scylladb#20089
instead of using a chain of `if-else`, use switch-case instead,
it's visually easier to follow than `if`-`else` blocks. and since
we never need to handle the `else` case, the `throw` statement
is removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we look up for the mode using the command line
option as the key, but that's incorrect if the command line option does
not match with any of the known names. in that case, `test_mode` just
create another pair of <sstring, test_modes>, and return the second
component of this pair. and the second component is not what we expect.
we should have thrown an exception.
in this change
* the test_mode map is marked const.
* the overloads for parsing / formatting the `test_modes` type are
added, so that boost::program_options can parse and format it.
after this change,
* we can print more user friendly error, like
```
/scylla perf-sstable --mode index-foo
error: the argument ('index-foo') for option '--mode' is invalid
Try --help.
```
instead of a bunch of output which is printed as if we passes the
correct option as the argument of the `--mode` option.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required to execute this req,
2. global topo req is executed by topo coordinator, which reads data attached to the req.
The KS name is among the data attached to the req. There's a time window between these steps where a to-be-altered KS could have been DROPped, which results in topo coordinator forever trying to ALTER a non-existing KS. In order to avoid it, the code has been changed to first check if a to-be-altered KS exists, and if it's not the case, it doesn't perform any schema/tablets mutations, but just removes the global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected changes, which is due to the fact that the code is written badly and needs to be refactored - an effort that's already planned under #19126
(I suggest to disable displaying whitespace differences when reviewing this PR).
Fixes: scylladb/scylladb#19576Closesscylladb/scylladb#19666
* github.com:scylladb/scylladb:
tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
cql: refactor rf_change indentation
Prevent ALTERing non-existing KS with tablets
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.
The lister resembles the directory_lister from util -- it returns
entries upon its .get() invocation, and should be .close()d at the end.
Internally the lister issues ListObjectsV2 request with provided prefix
and limits the server with the amount of entries returned not to consume
too much local memory (we don't have streaming XML parser for response).
If the result is indeed truncated, the subsequent calls include the
continuation token as per [1]
[1] https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method logic is clean and simple -- load sstable from the descriptor and sort it into one of collections (local, shared, remote, unsorted). To achieve that there's a bunch of helper methods, but they duplicate functionality of each other. Squashing most of this code into process_descriptor() makes it easier to read and keeps sstable_directory private API much shorter.
Closesscylladb/scylladb#20126
* github.com:scylladb/scylladb:
sstable_directory: Open-code load_sstable() into process_descriptor()
sstable_directory: Squash sort_sstable() with process_descriptor()
sstable_directory: Remove unused sstable_filename(desc) helper
sstable_directory: Log sst->get_filename(), not sstable_filename(desc)
sstable_directory: Keep loaded sst in local var
sstable_directory: Remove unused helpers
sstable_directory: Load sstable once when sorting
There are two load_sstable() overloads, and one of them is only used
inside process_descriptor(). What this loading helper does is, in fact,
processes given descriptor, so it's worth having it open-coded into its
caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter (caller) loads sstable, so does the former, so load it once
and then put it in either list/set, depending on flags and shard info.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some places that log sstable Data file name via sstable
descriptor. After previous patching all those loggers have sstable at
hand and can use sstable::get_filename() instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to decide which list to put sstable into, the sort_sstable()
first calls get_shards_for_this_sstable() which loads the sstable
anyway. If loaded shards contain only the current one (which is the
common case) sstable is loaded again. In fact, if the sstable happens to
be remote it's loaded anyway to get its open info.
Fix that by loading sstable, then getting shards directly from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reverse parameter is no longer used with native reverse reads.
The row ranges are provided in native reverse order together with
a reversed schema, thus the reverse parameter remain false all the
time and can be droped.
When a reverse slice is provided, it is given in the native reverse
format. Thus the ranges will be returned in the same order as stored
in the slice.
Therefore there is no need to distinguish between get_ranges and
get_native_ranges. The latter one gets dropped and get_ranges returns
ranges in the same order as stored in the slice.
The reverse parameter is no longer used with native reverse reads.
A reversed schema is provided and thus the reverse parameter shall
remain false all the time.
Simplify implementation and for clustering key ranges in native
reversed format, require a reversed table schema.
Trimming native reversed clustering key ranges requires a reversed
schema to be passed in. Thus, the reverse flag is no longer required
as it would always be set to false.
The reconcilable_result is built as it would be constructed for
forward read queries for tables with reversed order.
Mutations constructed for reversed queries are consumed forward.
Drop overloaded reversed functions that reverse read_command and
reconcilable_result directly and keep only those requiring smart
pointers. They are not used any more.
Remove schema reversing in query() and query_mutations() methods.
Instead, a reversed schema shall be passed for reversed queries.
Rename a schema variable from s into query_schema for readability.
For reversed queries, query_result() method accepts a reversed table
schema and read_command with a query schema version and a slice in
native reversed format.
Support mixed-node clusters. In such a case, the feature flag
native_reverse_queries is disabled and the read_command in sent
to replicas in the old regacy format (stores table schema version
and a slice in the legacy reverse format).
After the reconciliation, for the read+repair case, un-reversed
mutations are sent to replicas, i.e. forward ones.
Use a reversed schema and a native reversed slice when constructing
a read_command and executing a reversed select statement.
Such a created read_command is passed further down to query_pagers::pager
and storage::proxy::query_result that transform it to the format
they accept/know, i.e. lagacy.
In handle_read, detect whether a coming read_command is in the
legacy reversed format or native reversed format.
The result will be used to transform the read_command between format
as well as to transforms the results before they are send back to
the coordinator.
The reconcilable_result is reversed by reversing mutations for all
paritions it holds. Reversing is asynchronous to avoid potential
stall.
Use for transitions between legacy and native formats and in order
to support mixed-nodes clusters.
The read_command is reversed by reversing the schema version it
holds and transforming a slice from the legacy reversed format to
the native reversed format.
Use for trasition between format and to support mixed-nodes clusters
Reverse reads have already been with us for a while, thus this back
door option to read entire paritions forward and reversing them after
can be retired.
When signing AWS query one need to prepare "query string" which is a
line looking like `encode(query_param)=encode(query_value)&...`. Encoded
are only the query parameter names and values. It was missing in current
code and so far worked because no encodable characters were used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Consider the following:
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In order to fix the race between split and repair, we must introduce
the ability to split an "offline" sstable, one that wasn't added
to any of the table's sstable set yet.
It's not safe to split a sstable after adding it to the set, because
a failure to split can result in unsplit data left in the set, causing
split to fail down the road, since the coordinator thinks this replica
has only split data in the set.
Refs #19378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All lambdas passed to test_using_reusable_sst() conform to the prototype
void (test_env&, sstable_ptr)
All lambdas passed to test_using_reusable_sst_returning() conform to the
prototype
NON_VOID (test_env&, sstable_ptr)
The common parameter list of both prototypes can be expressed with the
concept
std::invocable<test_env&, sstable_ptr>
Once a "Func" template parameter (i.e., function type) satisfying this
concept is taken, then "Func"'s void or non-void return type can be
commonly expressed with
std::invoke_result_t<Func, test_env&, sstable_ptr>
In turn, test_env::do_with_async_returning<...> can be instantiated with
this return type, even if it happens to be "void".
([stmt.return] specifies, "[a] return statement with an operand of type
void shall be used only in a function that has a cv void return type",
meaning that
return func(env)
will do the right thing in the body of
test_env::do_with_async_returning<void>().)
Merge test_using_reusable_sst() and test_using_reusable_sst_returning()
into one. Preserve the function name from the former, and the
test_env::do_with_async_returning<...>() call from the latter.
Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#20090
there are chances that developer launch `pytest` right under
`test/nodetool`, in that case current working directory is not
the root directory of the project, so the path to suppression rules
does not point to a file.
to cater the needs to run the test under `test/nodetool`, let's
use the path deduced from the top_srcdir.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Unit testing for the SSTable validation API happens in
`sstable_validate_test`. Currently, this test checks the API against
some invalid SSTables with out-of-order clustering rows and out-of-order
partitions. However, both are types of content-level corruption that do
not trigger `malformed_sstable_exception` errors.
Extend the test to cover cases of file-level corruption as well, i.e.,
cases that would raise a `malformed_sstable_exception`. Construct an
SSTable with an invalid checksum to trigger this.
This is part of the effort to improve scrub to handle all kinds of
corruption.
Fixesscylladb/scylladb#19057
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#20096
`maybe_rehash` is complimentary and is not strictly require to succeed. If it fails, it will retry on the next call, but there's no reason to throw an exception that will fail its caller, since `maybe_rehash` is called as the final step after the caller has already succeeded with its action.
Minor enhancement for the error path, no backport required.
Closesscylladb/scylladb#19910
* github.com:scylladb/scylladb:
cell_locker: maybe_rehash: reindent
cell_locker: maybe_rehash: ignore allocation failures
Currently, each change to tablet metadata triggers a full metadata reload from disk. This is very wasteful, especially if the metadata change affects only a single row in the `system.tablets` table. This is the case when the tablet load balancer triggers a migration, this will affect a single row in the table, but today will trigger a full reload.
We expect tablet count to potentially grow to thousands and beyond and the overhead of this full reload can become significant.
This PR makes tablet metadata reload partial, instead of reloading all metadata on topology or schema changes, reload only the partitions that are affected by the change. Copy the rest from the in-memory state.
This is done with two passes: first the change mutations are scanned and a hint is produced. This hint is then passed down to the reload code, which will use it to only reload parts (rows/partitions) of the metadata that has actually changed.
The performance difference between full reload and partial reload is quite drastic:
```
INFO 2024-07-25 05:06:27,347 [shard 0:stat] testlog - Tablet metadata reload:
full 616.39ms
partial 0.18ms
```
This was measured with the modified (by this PR) `perf_tablets`, which creates 100 tables, each with 2K tablets. The test was modified to change a single tablet, then do a full and partial reload respectively, measuring the time it takes for reach.
Fixes: #15294
New feature, no backport needed.
Closesscylladb/scylladb#15541
* github.com:scylladb/scylladb:
test/perf/perf_tablets: add tablet metadata reload perf measurement
test/boost/tablets_test: add test for partial tablet metadata updates
db/schema_tables: pass tablet hint to update_tablet_metadata()
service/storage_service: load_tablet_metadata(): add hint parameter
service/migration_listener: update_tablet_metadata(): add hint parameter
service/raft/group0_state_machine: provide tablet change hint on topology change
service/storage_service: topology_state_load(): allow providing change hint
replica/tablets: add update_tablet_metadata()
replica/tablets: fix indentation
replica/tablets: extract tablet_metadata builder logic
replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
locator/tablets: add tablet_map::clear_tablet_transition_info()
locator/tablets: make tablet_metadata cheap to copy
mutation/canonical_mutation: add key()
Measure reload perf of full reload vs. partial reload, after changing a
single tablet.
While at it, modify the `--tablets-per-table` parameter, so that it has
a default parameter which works OOTB. The previous default was both too
large (causing oversized commitlog entry errors) and not a power of two.
Replace the has_tablet_mutations in `merge_tables_and_views()` with a
hint parameter, which is calculated in the caller, from the original
schema change mutations. This hint is then forwarded to the notifier's
`update_tablet_metadata()` so that subscribers can refresh only the
tablet partitions that changed.
The hint contains information related to what exactly changed, allowing
listeners to do partial updates, instead of reloading all metadata on
each notification.
Extract a hint of what a tablet mutation changed. The hint can be later
used to selectively reload only the changed parts from disk.
Two variants are added:
* get_tablet_metadata_change_hint() - extracts a hint from a list of
tablet mutations
* update_tablet_metadata_change_hint() - updates an existing hint based
on a single mutation, allowing for incremental hint extraction
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
Fixes#19960
Write path for sstables/commitlog need to handle the fact that IO extensions can
generate errors, some of which should be considered retry-able, and some that should,
similar to system IO errors, cause the node to go into isolate mode.
One option would of course be for extensions to simply generate std::system_errors,
with system_category and appropriate codes. But this is probably a bad idea, since
it makes it more muddy at which level an error happened, as well as limits the
expressibility of the error.
This adds three distinct types (sharing base) distinguishing permission, availabilty
and configuration errors. These are treated akin to EACCESS, ENOENT and EINVAL in
disk error handler and memtable write loop.
Tests updated to use and verify behaviour.
Closesscylladb/scylladb#19961
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.
Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.
The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.
The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
--------
seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::(anonymous namespace)::thread_wake_task
--------
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...
From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.
Fixes: #19873.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19987
If parent_info argument of compaction_manager::perform_compaction
is std::nullopt, then created compaction executor isn't tracked by task
manager. Currently, all compaction operations should by visible in task
manager.
Modify split methods to keep split executor in task manager. Get rid of
the option to bypass task manager.
Closesscylladb/scylladb#19995
* github.com:scylladb/scylladb:
compaction: replace optional<task_info> with task_info param
compaction: keep split executor in task manager
There are some bugs missed in task handler:
- wait_for_task does not wait until virtual tasks are done, but returns the status immediately;
- wait_for_task suffers from use after return;
- get_status_recursively does not set the kind of task essentials.
Fix the aforementioned.
Closesscylladb/scylladb#19930
* github.com:scylladb/scylladb:
test: add test to check that task handler is fixed
tasks: fix task handler
Remove xfail from all tests for #5361, as the issue is fixed.
Remove xfail from test_group_by_clustering_prefix_with_limit
It references #5362, but is fixed by #17237.
Refs #17237
Currently LIMIT not passed to the query executor at all and it was just
an accident that it worked for the case referenced in #17237. This
change passes the limit value down the chain.
The get_limit() function performed tasks outside of its scope - for
example checked if the statement was an aggregate. This change moves the
onus of the check to the caller.
The comment in the code already states that we should use the
user-defined page size if it's provided. To avoid OOM conditions we'll
use the internally defined limit as the upper bound or if no page size
is provided.
This change lays ground work for fixing #5362 and is necessary to pass
the test introduced in #19392 once it is implemented.
This patch adds `suppress_features` error injection. It allows to revoke
support for some features and it can be used to simulate upgrade process
in test.py.
Features to suppress are passed as injection's value, separated by `;`.
Example: `PARALLELIZED_AGGREGATION;UDA_NATIVE_PARALLELIZED_AGGREGATION`
Fixesscylladb/scylladb#20034Closesscylladb/scylladb#20055
before this change, we use the default options for
performing read on the input. and the default options
is like
```c++
struct file_input_stream_options {
size_t buffer_size = 8192; ///< I/O buffer size
unsigned read_ahead = 0; ///< Maximum number of extra read-ahead operations
};
```
which is not able to offer good throughput when
reading from disk, when we stream to S3.
so, in this change, we use options which allows better throughput.
Refs 061def001d
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20074
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.
Fixesscylladb/scylladb#18654Closesscylladb/scylladb#19170
This patch makes all cql connections update theirs service level parameters automatically when:
- any service level is created or changed
- one role is granted to another
- any service level is attached to/detached from a role
First of all, the patch defines what a service level and an effective service level are 938aa10509. No new type of service levels are introduced, the commit only clarifies definitions and names what an effective service level is.
(Effective service level is created by merging all service levels which are attached to all roles granted to the user. It represents exact values of connection's parameters.)
Previously, to find an effective service level of a user, it required O(n) internal queries: O(n) queries to recursively find all granted roles (`standard_role_manager::query_granted()`) and a query for each role to get its service level (`standard_role_manager::get_attribute()`, which sums to O(n) queries).
Because we want to reload SL parameters for all opened cql connections, we don't want to do O(n) queries for every connection, every time we create or change any service level/grant one role to another/attach or detach a service level to/from a role.
To speed it up, the patch adds another layer of service level controller cache, which stored `role_name -> effective_service_level` mapping. This way finding a effective service level for a role is only a lookup to a map.
Building the new cache requires only 2 queries: one to obtain all role hierarchy one to get all roles' service level.
Fixesscylladb/scylladb#12923Closesscylladb/scylladb#19085
* github.com:scylladb/scylladb:
test/auth_cluster/test_raft_service_levels: add test for automatic connection update
api/cql_server_test: add CQL server testing API
transport/cql_server: subscribe to sl effective cache reloaded
transport/controller: coroutinize `subscribe_server` and `unsubscribe_server`
transport/cql_server: add method to update service level params on all connections
generic_server: use async function in `for_each_gently()`
service/qos/sl_controller: use effective service levels cache
service/qos/service_level_controller: notify subscribers on effective cache reloaded
service/raft/group0_state_machine: update effective service levels cache
service/topology_coordinator: migrate service levels before auth
service/qos/service_level_controller: effective service levels cache
utils/sorting: allow to pass any container as verticies
service/qos/service_level_controller: replace shard check to assert
service/qos: define effective service level
service/qos/qos_common: use const reference in `init_effective_names()`
service/qos/service_level_controller: remove unused field
auth: return map of directly granted roles
test/auth/test_auth_v2_migration: create sl1 in the test
We have two mechanism to give visibility into reads having to process many tombstones:
* a warning in the logs, triggered if a read processed more the `tombstone_warn_threshold` dead rows/tombstones
* a trace message, which includes stats of the amount of rows in the page, including the amount of live and dead rows as well as tombstones
This series extends this to also include information on cells, so we have visibility into the case where a read has to process an excessive amount of cell tombstones (mainly because of collections).
A log line is now also logged if the amount of dead cells/tombstones in the page exceeds `tombstone_warn_threshold`. The trace message is also extended to contain cell stats.
The `tombstone_warn_threshold` log lines now receive a 10s rate-limit to avoid excessive log spamming. The rate-limit is separate for the row and cell logs.
Example of the new log line (`tombstone_warn_threshold=10` ):
```
WARN 2024-05-30 07:56:44,979 [shard 0:stmt] querier - Read 98 live cells and 126 dead cells/tombstones for system_schema.scylla_tables <partition-range-scan> (-inf, +inf) (see tombstone_warn_threshold)
```
Example of the new tracing message:
```
Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 13 cell(s) (1 live, 12 dead) [shard 0] | 2024-05-30 08:13:19.690803 | 127.0.0.1 | 6114 | 127.0.0.1
```
Fixes: https://github.com/scylladb/scylladb/issues/18996
Improvement, not a backport candidate.
Closesscylladb/scylladb#18997
* github.com:scylladb/scylladb:
test/boost: mutation_test: add test for cell compaction stats
mutation/compact_and_expire_result: drop operator bool()
querier: consume_page(): add rate-limiting to tombstone warnings
querier: consume_page(): add cell stats to page stats trace message
querier: consume_page(): add tombstone warning for cell tombstones
querier: consume_page(): extract code which logs tombstone warning
mutation/mutation_compactor: collect and aggregate cell compaction stats
mutation: row::compact_and_expire(): use compact_and_expire_result
collection_mutation: compact_and_expire(): use compact_and_expire_result
mutation: introduce compact_and_expire_result
Fixes#13334
All required code paths (see enterprise) now uses
extensions::is_extension_internal_keyspace.
The old mechanism can be removed. One less global var.
Closesscylladb/scylladb#20047
This is a followup to #19937, for #19803. See in particular [this comment](https://github.com/scylladb/scylladb/issues/19803#issuecomment-2258371923).
The primary conversion target is coroutines. However, while coroutines are the most convenient style, they are only infrequently usable in this case, for the following reasons:
- Wherever we have a `future::finally()` that calls a cleanup function that returns a future (which must be awaited), we cannot use `co_await`. We can only use `seastar::async()` with `deferred_close` or `defer()`.
- The code passes lots of lambdas, and `co_await` cannot be used in lambdas. First, I tried, and the compiler rejects it; second, a capturing lambda that is a coroutine is a trap [[1]](https://devblogs.microsoft.com/oldnewthing/20211103-00/?p=105870) [[2]](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rcoro-capture).
In most cases, I didn't have to use naked `seastar::async()`; there were specialized wrappers in place already. Thus, most of the changes target `seastar::thread` context under existent `seastar::async()` wrappers, and only a few functions end up as coroutines.
The last patch in the series (`test/sstable: remove useless variable from promoted_index_read()`) is an independent micro-cleanup, the opportunity for which I thought to have noticed while reading the code.
The tail of `test/boost/sstable_test.cc` (the stuff following `promoted_index_read()`) is already written as `seastar::thread`. That's already better (for readability) than future chaining; but could have I perhaps further converted those functions to coroutines? My answer was "no":
- Some of the candidate functions relied on deferred cleanups that might need to yield (all three variants of `count_rows()`).
- Some had been implemented by passing lambdas to wrappers of `seastar::async()` (`sub_partition_read()`, `sub_partitions_read()`).
- The test case `test_skipping_in_compressed_stream()` initially looked promising for co-routinization (from its starting point `seastar::async()`), because it seemed to employ no deferred cleanup (that might need to yield). However, the function uses three lambdas that must be able to yield internally, and one of those (`make_is()`) is even capturing.
- The rest (`test_empty_key_view_comparison()`, `test_parse_path_good()`, `test_parse_path_bad()`) was synchronous code to begin with.
```
test/boost/sstable_test.cc | 188 +++++++++-----------
1 file changed, 83 insertions(+), 105 deletions(-)
```
Refactoring; no backport needed.
Closesscylladb/scylladb#20011
* github.com:scylladb/scylladb:
test/sstable: remove useless variable from promoted_index_read()
test/sstable: rewrite promoted_index_read() with async()
test/sstable: unfuturize lambda invocation in test_using_reusable_sst*()
test/sstable: rewrite wrong_range() with async()
test/sstable: simplify not_find_key_composite_bucket0() under test_using_reusable_sst()
test/sstable: rewrite full_index_search() with async()
test/sstable: simplify find_key*(), all_in_place() under test_using_reusable_sst()
test/sstable: rewrite (un)compressed_random_access_read() with async()
test/sstable: simplify write_and_validate_sst()
test/sstable: simplify check_toc_func() under async()
test/sstable: simplify check_statistics_func() under async()
test/sstable: simplify check_summary_func() under async()
test/sstable: coroutinize check_component_integrity()
test/sstable: rewrite write_sst_info() with async()
test/sstable: simplify missing_summary_first_last_sane()
test/sstable: coroutinize summary_query_fail()
test/sstable: rewrite summary_query() with async()
test/sstable: coroutinize (simple/composite)_index_read()
test/sstable: rewrite index_read() with async()
test/sstable: rewrite test_using_reusable_sst() with async()
test/sstable: rewrite test_using_working_sst() with async()
Add a CQL server testing API with and endpoint to dump
service level parameters of all CQL connections.
This endpoint will be later used to test functionality of
automated updating CQL connections parameters.
Make cql server (but not maintenance server) is subscribed to qos
configuration change.
Trigger update of connections' service level params on effective cache
reloaded event.
It's not done on maintenance server because it doesn't support role
hierarchy nor attaching service levels.
connections
Trigger update of service level param on every cql connection.
In enterprise, the method needs also to update connections' scheduling
group.
In the following patch, we will add a method to update service levels
parameters for each cql connections.
To support this, this patch allows to pass async function as a parameter
to `for_each_gently()` method.
Updates to `system.role_members` and `system.role_attributes` affect
effective service levels cache, so applying mutations to those tables
should reload the effective SL cache.
Effective service level cache will be updated when mutations are applied to
some of the auth tables.
But the effective cache depends on first-level service levels cache, so
service levels data should be migrated before auth data.
Add a second layer of service_level_controller cache which contains
role name -> effective service level mapping.
To build the mapping, controller uses first cache layer (service level
name -> service level) and 2 queries to auth tables (one to `roles` and
one to `role_members`).
The container containing all verticies doesn't have to be a vector.
Allowing to pass any container that meet conditions, will make to
function more flexible.
Write down definitions of `service level` and `effective service level`
in service/qos/service_level_controller.hh.
Until now, effective service level was only used as result of
`LIST EFFECTIVE SERVICE LEVEL OF <role>`.
Now we want to have quick access to effective service level of
each role and introduce cache of effective sl to do it.
New definitions clarify things.
The commit also renames:
- `update_service_levels_from_distributed_data` -> `update_service_levels_cache`
Later we will introduce effective_service_level_cache, so this change
standarizes the names.
- `find_service_level` -> `find_effective_service_level`
The function actualy returns effective service level.
Returns multimap of directly granted roles for each role. Uses
only one query to create the map, instead of doing recursive queries
for each individual role.
Test `test_auth_v2_migration` creates auth data where role `users`
has assigned service level `sl:fefe` but the service level isn't
actually created.
In following patches, we are going to introduce effective service levels
cache which depends on auth and is refreshed when mutations are applied
to v2 auth tables.
Without this changes, this test will fail because the service level
doesn't exist.
Also the name `sl:fefe` is change to `sl1`.
It runs in the background and consists of two parts -- async() lambda and following .then()-s. This PR move the background running code into its own method and coroutinizes it in parts. With #19954 merged it finally looks really nice.
Closesscylladb/scylladb#20058
* github.com:scylladb/scylladb:
view_builder: Restore indentation after previous patches
view_builder: Coroutinize inner start_in_background() calls
view_builder: Coroutinize outer start_in_background() calls
view_builder: Add helper method for background start
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.
Turn exceptions from aborts into WARN level.
Also improve logging by printing the command that failed.
Fixesscylladb/scylladb#19754Closesscylladb/scylladb#19935
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.
This also violates invariants about replica placement and this state
cannot be fixed by topology operations.
One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:
load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at: ...
Fixes#20032Closesscylladb/scylladb#20053
Sync points are created, via POST HTTP requests, for a subset of nodes
in the cluster. Those nodes are specified in a request's parameter
`target_hosts`. When the parameter is empty, Scylla should assume
the user wants to create a sync point for ALL nodes.
Before these changes, sync points were created only for LIVE nodes.
If a node was dead but still part of the cluster and the user
requested creating a sync point leaving the parameter `target_hosts`
empty, the dead node was skipped during the creation of the sync point.
That was inconsistent with the guarantees the sync point API provides.
In this commit, we fix that issue and add a test verifying that
the changes have made the implementation compliant with the design
of the sync point API -- the test only passes after this commit.
Fixesscylladb/scylladb#9413Closesscylladb/scylladb#19750
One of the co_await-ed parts of this method is async() lambda. It can be
coroutinized too. One thing to care is the semaphore units -- its scope
should (?) terminate earlier than the whole start_in_background() so
release it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method consists of two parts -- one running in async() thread and
continuations to it. This patch turns the latter chain into co_await-s.
The mentioned chain is "guarded" by then_wrapped() catch of any
exception, which is turned into a plain try-catch block.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The view_builder::start() happens in the background. It's good to have
explicit start_in_background() method and coroutinize it next.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In this commit, we describe the mechanism of sync point
in Hinted Handoff in the user documentation. We explain
the motivation for it and how to use it, as well as list
and describe all of the parameters involved in the process.
Errors that may appear and experienced by the user
are addressed in the article.
Fixesscylladb/scylladb#18500Closesscylladb/scylladb#19686
The get_all_data_file_locations and get_saved_caches_location get the
returned data from db::config and should be next other endpoints working
with config data.
refs: #2737
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19958
When starting, view builder wants all shards to synchronize with each other in the middle of initialization. For that they all synchronize via shard-0's instance counter and a shared future. There's cross-shard barrier in utils/ that provides the same facility.
Closesscylladb/scylladb#19954
* github.com:scylladb/scylladb:
view_builder: Drop unused members
view_builder: Use cross-shard barrier on start
view_builder: Add cross-shard barrier to its .start() method
Having an operator bool() on this struct is counter-intuitive, so this
commit drops it and migrates any remaining users to bool is_live().
The purpose of this operator bool() was to help in incrementally replace
the previous bool return type with compact_and_expire_result in the
compact_and_expire() call stack. Now that this is done, it has served
its purpose.
Soon, we want to log a warning on too many cell tombstones as well.
Extract the logging code to allow reuse between the row and cell
tombstone warnings.
There are some bugs missed in task handler:
- wait_for_task does not wait until virtual tasks are done, but
returns the status immediately;
- wait_for_task suffers from use after return;
- get_status_recursively does not set the kind of task essentials.
Fix the aforementioned.
This commit extracts the information about the configuration the user should do
right after installation (especially running scylla_setup) to a separate file.
The file is included in the relevant pages, i.e., installing with packages
and installing with Web Installer.
In addition, the examples on the Web Installer page are updated
with supported versions of ScyllaDB.
Fixes https://github.com/scylladb/scylladb/issues/19908Closesscylladb/scylladb#20035
Add more logging for raft-based topology operations in INFO and DEBUG
levels.
Improve the existing logging, adding more details.
Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).
Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.
These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).
Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945Closesscylladb/scylladb#19998
This PR adds the 6.0-to-6.1 upgrade guide (including metrics) and removes the 5.4-to-6.0 upgrade guide.
Compared 5.4-to-6.0, the the 6.0-to-6.1 guide:
- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
so now there's no scenario that would require the user to follow the validation procedure.
- Removed the references to the Enable Consistent Topology Updates page (which was in version 6.0 and is removed with this PR) across the docs.
See the individual commits for more details.
Fixes https://github.com/scylladb/scylladb/issues/19853
Fixes https://github.com/scylladb/scylladb/issues/19933
This PR must be backported to branch-6.1 as it is critical in version 6.1.
Closesscylladb/scylladb#19983
* github.com:scylladb/scylladb:
doc: remove the 5.4-to-6.0 upgrade guide
doc: add the 6.0-to-6.1 upgrade guide
Currently, the resource utilization in CI is low. Increasing the number of clusters will increase how many tests are executed simultaneously. This will decrease the time it takes to execute and improve resource utilization.
Related: https://github.com/scylladb/qa-tasks/issues/1667Closesscylladb/scylladb#19832
The command has a singl check for the missing keyspace and/or table
parameters and if the check fails, there is a combined error message.
Apparently this is confusing, so split the check so that missing
keyspace and missing table args have its own check and error message.
Fixes: scylladb/scylladb#19984Closesscylladb/scylladb#20005
This commit removes the 5.4-to-6.0 upgrade guide and all references to it.
It mainly removes references to the Enable Consistent Topology Updates page,
which was added as enabling the feature was optional.
In rare cases, when a reference to that page is necessary,
the internal link is replaced with an external link to version 6.0.
Especially the Handling Cluster Membership Change Failures page was modified
for troubleshooting purposes rather than removed.
instead of reinventing the wheel, let's use the existing one.
in this change, we trade the `div_ceil()` implementated in s3/client.cc
for the existing one in utils/div_ceil.hh . because we are not using
`std::lldiv()` anymore, the corresponding `#include <cstdlib>` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20000
instead of using `std::rethrow_exception()`, use
`coroutine::return_exception_ptr()` which is a little bit more
efficient.
See also 6cafd83e1c
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20001
fmt 11 enforces the constness of `format()` member function, if it
is not marked with `const`, the tree fails to build with fmt 11, like:
```
/usr/include/fmt/base.h:1393:23: error: no matching member function for call to 'format'
1393 | ctx.advance_to(cf.format(*static_cast<qualified_type*>(arg), ctx));
| ~~~^~~~~~
/usr/include/fmt/base.h:1374:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::context>::format_custom_arg<service::migration_badness, fmt::formatter<service::migration_badness>>' requested here
1374 | custom.format = format_custom_arg<
| ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:170:14: note: in instantiation of function template specialization 'fmt::format_to<fmt::basic_appender<char>, const locator::global_tablet_id &, const locator::tablet_replica &, const locator::tablet_replica &, const service::migration_badness &, 0>' requested here
170 | fmt::format_to(ctx.out(), "{{tablet: {}, {} -> {}, badness: {}", candidate.tablet, candidate.src,
| ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:161:10: note: candidate function template not viable: 'this' argument has type 'const fmt::formatter<service::migration_badness>', but method is not marked const
161 | auto format(const service::migration_badness& badness, FormatContext& ctx) {
| ^
```
so, in this change, we mark these two `format()` member functions const.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20013
rewrite the function as coroutine to make it easier to read and maintain, following lifetime issues we had and fixed in this function.
The second commit adds a test that drops a table while there is a counter update operation ongoing in the table.
The test reproduces issue https://github.com/scylladb/scylla-enterprise/issues/4475 and verifies it is fixed.
Follow-up to https://github.com/scylladb/scylladb/pull/19948
Doesn't require backport because the fix to the issue was already done and backported. This is just cleanup and a test.
Closesscylladb/scylladb#19982
* github.com:scylladb/scylladb:
db: test counter update while table is dropped
db: coroutinize do_apply_counter_update
Recently, some users have seen "Key size too large" errors in various
places. Cassandra and Scylla impose a 64KB length limit on keys, and
we have known about bugs in this area for a long time - and even had
some translated Cassandra unit tests that cover some of them. But these
tests did not cover all the corner cases and left us with partial and
fragmented knowledge of this problem, spread over many test files and
many issues.
In this patch, we add a single test file, test/cql-pytest/test_key_length.py
which attempts to rigourously explore the various bugs we have with
CQL key length limits. These test aim to reproduce all known bugs in
this area:
* Refs #3017 - CQL layer accepts set values too large to be written to
an sstable
* Refs #10366 - Enforce Key-length limits during SELECT
* Refs #12247 - Better error reporting for oversized keys during INSERT
* Refs #16772 - Key length should be limited to exactly 65535, not less
The following less interesting bug is already covered by many tests so
I decided not to test it again:
* Refs #7745 - Length of map keys and set items are incorrectly limited
to 64K in unprepared CQL
There's also a situation in materialized views and secondary indexes,
where a column that was _not_ a key, now becomes a key, and a length
limit needs to be enforced on it. We already have good test coverage
for this (in test/cql-pytest/test_secondary_index.py and in
test/cql-pytest/test_materialized_view.py), and we have an issue:
* Refs #8627 - Cleanly reject updates with indexed values where value > 64k
All 16 tests added here pass on Cassandra 5 except one that fails on
https://issues.apache.org/jira/browse/CASSANDRA-19270, but 11 of the
tests currently fail on Scylla (6 on #12247, 2 on #10366, 3 on #16772).
It is possible that our decision in #16772 will not be to fix Scylla
to match Cassandra but rather to declare that strict compatibility isn't
needed in this case or even that Cassandra is wrong. But even then,
having these tests which demonstrate the behavior of both Cassandra
and Scylla will be important.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16779
When debugging coredumps some (small, but useful) information is hidden
in the notes of the core ELF file. Add some words about it exists, what
it includes and the thing that is always forgotten -- the way to get one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19962
Commit ad0e6b79 (replica: Remove all_datadir from keyspace config) removed all_datadirs from keyspace config, now it's datadir turn. After this change keyspace no longer references any on-disk directories, only the sstables's storage driver attached to keyspace's tables does.
refs #12707Closesscylladb/scylladb#19866
* github.com:scylladb/scylladb:
replica: Remove keyspace::config::datadir
sstables/storage: Evaluate path for keyspace directory in storage
sstables/storage: Add sstables_manager arg to init_keyspace_storage()
This commit adds OS support for version 6.1 and removes OS support for 5.4
(according to our support policy for versions).
Closesscylladb/scylladb#19992
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
The for_tests constructor has a metrics parameter defaulted to
register_metrics::no, but when delegating to the other constructor, a
hard-coded register_metrics::no is passed. This makes no difference
currently, because all callers use the default and the hard-coded value
corresponds to it. Let's fix it nevertheless to avoid any future
surprises.
Closesscylladb/scylladb#20007
The large_partition_schema() call returns a copy of the "schema_ptr"
object that points to an effectively statically initialized thread_local
"schema" object. The large_partition_schema() call has no bearing on
whether, or when, the "schema" object is constructed, and has no side
effects (other than copying an "lw_shared_ptr" object). Furthermore, the
return value of large_partition_schema() is not used for anything in
promoted_index_read().
This redundant call seems to date back to original commit 3dd079fb7a
("tests: add test for reading parts of a large partition", 2016-08-07).
Remove the call and the variable.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
All lambdas passed to test_using_reusable_sst() and
test_using_reusable_sst_returning() have been converted to future::get()
calls (according to the seastar::thread context that they are now executed
in). None of the lambdas return futures anymore; they all directly return
void or non-void. Therefore, drop futurize_invoke(...).get() around the
lambda invocations in test_using_reusable_sst*().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
For better readability, replace the future::then() chaining (and the
associated manual fiddling with object lifecycles) with future::get() (and
rely on seastar::thread's stack). We're already in seastar::thread
context.
Similarly, replace the future::finally() underlying with_closeable() with
deferred_close(); with the assumption that mutation_reader::close() never
fails (and is therefore safe to call in the "deferred_close" destructor).
This is actually guaranteed, as mutation_reader::close() is marked
"noexcept".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
According to early patch "test/sstable: rewrite test_using_reusable_sst()
with async" in this series, lambdas passed to test_using_reusable_sst()
are invoked:
(a) less importantly here, in seastar::thread context,
(b) more importantly here, futurized (temporarily so).
The test case not_find_key_composite_bucket0() doesn't chain futures;
therefore it needs no conversion to future::get() for purpose (a);
however, we can eliminate its empty future return. Fact (b) will cover for
that, until all such lambdas are converted to direct "void" returns (at
which point we can remove the futurization from
test_using_reusable_sst()).
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
For better readability, replace future::then() chaining with
future::get(). (We're already in seastar::thread context.)
This patch is best viewed with "git show -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
According to early patch "test/sstable: rewrite test_using_reusable_sst()
with async" in this series, lambdas passed to test_using_reusable_sst()
are invoked:
(a) less importantly here, in seastar::thread context,
(b) more importantly here, futurized (temporarily so).
The test cases find_key_map(), find_key_set(), find_key_list(),
find_key_composite(), all_in_place() don't chain futures; therefore they
need no conversion to future::get() for purpose (a); however, we can
eliminate their empty future returns. Fact (b) will cover for that, until
all such lambdas are converted to direct "void" returns (at which point we
can remove the futurization from test_using_reusable_sst()).
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
All three lambdas passed to write_and_validate_sst() now use future::get()
rather than future::then() chaining; in other words, the future::get()
calls inside all these seastar::thread contexts have been pushed down to
the lambdas. Change all these lambdas' return types from future<> to void.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.
We're going to eliminate the trailing "return make_ready_future<>()"
later.
This patch is best viewed with "git show -W -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.
We're going to eliminate the trailing "return make_ready_future<>()"
later.
This patch is best viewed with "git show -W -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.
We're going to eliminate the trailing "return make_ready_future<>()"
later.
This patch is best viewed with "git show -W -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
check_component_integrity() does not rely on any deferred close or stop
operations; turn it into a coroutine therefore, for best readability.
This conversion demonstrates particularly well how much the stack eases
coding. We no longer need to artificially extend the lifetime of "tmp"
with a final
.then([tmp] {})
future. Consequently, "tmp" no longer needs to be a shared pointer to an
on-heap "tmpdir" object; "tmp" can just be a "tmpdir" object on the stack.
While at it, eliminate the single-use local objects "s" and "gen", for
movability's sake. (We could use std::move() on these variables, but it
seems easier to just flatten the function calls that produce the
corresponding rvalues into the write_sst_info() argument list.)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to test_using_reusable_sst() is now invoked --
futurized, transitorily -- in seastar::thread context; stop returning an
explicit make_ready_future<>() from the lambda.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
summary_query_fail() does not rely on any deferred close or stop
operations; turn it into a coroutine therefore, for best readability.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
simple_index_read() and composite_index_read() do not rely on any deferred
close or stop operations; turn them into coroutines therefore, for best
readability.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Improve the readability of test_using_reusable_sst() by replacing
future::then() chaining with test_env::do_with_async() and future::get().
Unlike seastar::async(), test_env::do_with_async() restricts its input
lambda to returning "void". Because of this, introduce the variant
test_using_reusable_sst_returning(), based on
test_env::do_with_async_returning(), for lambdas returning non-void. Put
the latter to use in index_read() at once.
Subsequently, we'll gradually convert the lambdas passed to
test_using_reusable_sst() and test_using_reusable_sst_returning() from
returning futures to returning direct values. In order for
test_using_reusable_sst() and test_using_reusable_sst_returning() to cope
with both types of lambdas, wrap the lambdas into futurize_invoke().get().
In the seastar::thread context, future::get() will gracefully block on
genuine futures, and return immediately on direct values that were
futurized on the spot.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Make test_using_working_sst() easier to read by:
(1) replacing test_env::do_with() with seastar::async(),
seastar::defer(), and future::get();
(2) replacing seastar::async() and seastar::defer() with
test_env::do_with_async().
Technically speaking, this change does not perfectly preserve exceptional
behavior. Namely, test_env::do_with() uses future::finally() to link
test_env::stop() to the chain of futures, and future::finally() permits
test_env::stop() itself to throw an exception -- potentially leading to a
seastar::nested_exception being thrown, which would carry both the
original exception and the one thrown by test_env::stop().
Contrarily, the test_env::stop() deferred with seastar::defer() runs in a
destructor, and therefore test_env::stop() had better not throw there.
However, we will assume that test_env::stop() does not throw, albeit not
marked "noexcept". Prior commits 8d704f2532 ("sstable_test_env:
Coroutinize and move to .cc test_env::stop()", 2023-10-31) and
2c78b46c78 ("sstables::test_env: Carry compaction manager on board",
2023-10-31) show that we've considered individual actions in
test_env::stop() not to throw before.
The 128KB stack of seastar::thread (which underlies seastar::async())
should be a tolerable cost in a test case, in exchange for the improved
readability.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.
Use find_column_family at each table occurrence in insert_repair_meta
instead.
Closesscylladb/scylladb#19953
in fabab2f4, we introduced preemption_source, and added
`SCYLLA_ENABLE_PREEMPTION_SOURCE` preprocessor macro to enable
opt-in the pluggable preemption check.
but CMake building system was not updated accordingly.
so, in this change, let's sync the CMake building system with
`configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19951
This reverts commit c3bea539b6.
Since it breaking offline-installer artifact-tests. Also, it seems that we should have merged it in the first place since we don't need scylla-housekeeping checks for offline-installer
Closesscylladb/scylladb#19976
compaction_manager::perform_compaction does not create task manager
task for compaction if parent_info is set to std::nullopt. Currently,
we always want to create task manager task for compaction.
Remove optional from task info parameters which start compaction.
Track all compactions with task manager.
If perform_compaction gets std::nullopt as a parent info then
the executor won't be tracked by task manager.
Modify storage_group::split call so that it passes empty task_info
instead of nullopt to track split.
With the recently added mv admission control, we can now
test how are the view update backlogs updated and propagated
without relying just on the response delays that it was causing
until now.
This patch adds a test for it, replicating issues scylladb/scylladb#18461
and scylladb/scylladb#18783.
In the test, we start with an empty view update backlog, then perform
a write to it, increasing its backlog and saving the updated backlog
on coordinator, the backlog then drops back to 0, we wait 1s for the
backlog to be gossiped and we perform another write which should
succeed.
Due to scylladb/scylladb#18461, the test would fail because
in both gossip rounds before and after the write, the backlog was empty,
causing the write to be blocked by admission control indefinitely.
Due to scylladb/scylladb#18783, the test would fail because when
the backlog drops back to 0 after the write, the change is never
registered, causing all writes to be blocked as well.
In this patch we add 2 tests for checking that the mv admission control works.
The first one simply checks whether, after increasing the backlog on one node
over the admission control threshold, the following request is rejected with
the error message corresponding to the admission control.
The second one checks whether, after triggering admission control, the entire
user request fails instead of just failing a replica write. This is done by
performing a number of writes, some of which trigger the admission control
and cause retries, then checking if the node that had a large view update backlog
received all the writes. Before, the writes would succeed on enough replicas,
reaching QUORUM, and allowing the user write to succeed and cause no retries,
even though on the replica with a high backlog the write got rejected due to
the backlog size.
To avoid an expensive stack unwind, instead of throwing an error,
we can just return it thanks to the boost::result type that the
affected methods use. The result with an exception needs to be
constructed not implicitly, but with boost::outcome_v2::failure,
because the exception, converted into coordinator_exception_container
can be then converted into both into a successful response_id_type
as well as into a failure.
Currently, when a replica's view update backlog is full, the write is still
sent by the coordinator to all replicas. Because of the backlog, the write
fails on the replica, causing inconsistency that needs to be fixed by repair.
To avoid these inconsistencies, this patch adds a check on the coordinator
for overloaded replicas. As a result, a write may be rejected before being
sent to any replicas and later retried by the user, when the replica is no
longer overloaded.
Fixesscylladb/scylladb#17426
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.
When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.
With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.
Fixesscylladb/scylladb#8199Closesscylladb/scylladb#19338
* github.com:scylladb/scylladb:
mv: skip reading rows when generating partition tombstone update
mv: delete a partition in a single operation when applicable
cql-pytest: move ScyllaMetrics to util file to allow reuse
Add a test that drops a table while there is a counter update operation
ongoing in the table.
The test reproduces issue scylladb/scylla-enterprise#4475 and verifies
it is fixed.
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.
In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.
This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance. Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.
We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.
The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.
For example, for the following parameters:
params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}
The results are:
Before:
Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}
After:
Overcommit : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
Overcommit : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
So worst shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.
Also, node overcommit for table1 was reduced from 1.81 to 1.02.
The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.
The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).
One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.
Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.
Fixes https://github.com/scylladb/scylladb/issues/16824Closesscylladb/scylladb#19779
* github.com:scylladb/scylladb:
test: perf: tablet_load_balancing: Test with higher shard and tablet counts
tablets: load_balancer: Avoid quadratic complexity when finding best candidate
tablets: load_balancer: Maintain load sketch properly during intra-node migration
tablets: load_balancer: Use "drained" flag
test: perf: tablet_load_balancing: Report load balancer stats
tablets: load_balancer: Move load_balancer_stats_manager to header file
tablets: load_balancer: Split evaluate_candidate() into src and dst part
tablets: load_balancer: Optimize evaluate_candidate()
tablets: load_balancer: Add more statistics
tablets: load_balancer: Track load per table on cluster level
tablets: load_balancer: Track load per table on node level
tablets: load_balancer: Use a single load sketch for tracking all nodes
locator: load_sketch: Introduce populate_dc()
tablets: load_balancer: Modify target load sketch only when emitting migration
locator: load_sketch: Introduce get_most_loaded_shard()
locator: load_sketch: Introduce get_least_loaded_shard()
locator: load_sketch: Optimize pick()/unload()
locator: load_sketch: Introduce load_type
test: perf: tablet_load_balancing: Report total tablet counts
test: perf: tablet_load_balancing: Print run parameters in the single simulation case too
test: perf: tablet_load_balancing: Report time it took to schedule migrations
tablets: load_balancer: Log table load stats after each migration
tablets: load_balancer: Log per-shard load distribution in debug level
tablets: load_balancer: Improve per-table balance
tablets: load_balancer: Extract check_convergence()
tablets: load_balancer: Extract nodes_by_load_cmp
tablets: load_balancer: Maintain tablet count per table
tablets: load_balancer: Reuse src_node_info
test: perf: tablet_load_balancing: Print warnings about bad overcommit
test: perf: tablet_load_balancing: Allow running a single simulation
test: perf: tablet_load_balancing: Report best possible shard overcommit
test: perf: tablet_load_balancing: Report global shard overcommit
This commit adds the 6.0-to-6.1 upgrade guide.
Compared to the previous upgrade guide:
- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
so now there's no scenario that would require the user to follow the validation procedure.
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required
to execute this req,
2. global topo req is executed by topo coordinator, which reads data
attached to the req.
The KS name is among the data attached to the req.
There's a time window between these steps where a to-be-altered KS could
have been DROPped, which results in topo coordinator forever trying to
ALTER a non-existing KS. In order to avoid it, the code has been changed
to first check if a to-be-altered KS exists, and if it's not the case,
it doesn't perform any schema/tablets mutations, but just removes the
global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected
changes, which is due to the fact that the code is written badly and
needs to be refactored - an effort that's already planned under #19126Fixes: #19576
It's only needed to start hints via proxy, but proxy can do it without gossiper argument
Closesscylladb/scylladb#19894
* github.com:scylladb/scylladb:
storage_service: Remote gossiper argument from join_cluster()
proxy: Use remote gossiper to start hints resource manager
hints: Const-ify gossiper references and anchor pointers
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.
Fixesscylladb/scylla-enterprise#4475Closesscylladb/scylladb#19948
A user complained that ScyllaDB is incompatible with Cassandra when it
requires ALLOW FILTERING on a restriction like WHERE x=1 AND y=1 where
x and y are two columns with secondary indexes.
In the tests added in this patch we show that:
1. Scylla *is* compatible with Cassandra when the traditional "CREATE
INDEX" is used - ALLOW FILTERING *is* required in this case in both
Cassandra and Scylla.
2. If SAI is used in Cassandra (CREATE CUSTOM INDEX USING 'SAI'),
indeed ALLOW FILTERING becomes optional. I believe this is incorrect
so I opened CASSANDRA-19795.
These two tests combined show that we're not incompatible with Cassandra,
rather Cassandra's two index implementations are incompatible between
themselves, and Scylla is in fact compatible in this case with Cassadra's
traditional index and not with SAI.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19909
If the source and destination shards picked for migration based on
global tablet balance do not have a good candidate in terms of effect
on per-table balance, the algorithm explores other source shards and
destinations. This has quadratic complexity in terms of shard count in
the worst case, when there are no good candidates.
Since we can have up to ~200 shards, this can slow down scheduling
significantly. I saw total scheduling time of 5 min in the following run:
scylla perf-load-balancing -c1 -m1G --iterations=8 \
--nodes=4 --tablets1=1024 --tablets2=8096 \
--rf1=2 --rf2=3 --shards=256
To improve, change the apprach to first find the best source shard and
then best target shard, sequentially. So it's now linear in terms of
shard count.
After the change, the total scheduling time in that run is down to 4s.
Minimizing source and destination metrics piece-wise minimizes the
combined metric, so badness of the best candidate doesn't suffer after
this change.
Affects only intra-node migration. The code was recording destination
shard as taken and did not un-take it in case we skipped the migration
due to lack of candidates.
Noticed during code review. Impact is minor, since even if this leads
to suboptimal balance, the next scheduling round should fix it.
Also, the source shard was not unloaded, but that should have no
impact on decisions. But to be future-proof, better to maintain the
load accurately in case the algorithm is extended with more steps.
Some of the calls inside the `raft_group0_client::start_operation()` method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.
This was in particular the case of operations that try to take the ownership of the raft group semaphore (`get_units(semaphore)`) - these waits should be cancelled when the abort source is triggered.
This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3
Fixesscylladb/scylladb#19223Closesscylladb/scylladb#19860
* github.com:scylladb/scylladb:
raft: fix the shutdown phase being stuck
raft: use the abort source reference in raft group0 client interface
There's a counter and a shared future on board, that used to facilitate
start-time barrier synchronization. Now they are not needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When starting, view builder spawns an async background fibers, and upon
its completion each shard needs to wait for other shards to do the same.
This is exactly what cross-shard barrier is about, so instead of
synchronizing via v.b.'s shard-0 instance, use the barrier. This makes
the view_builder::start() shorder and earier to read.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The barrier will be used by next patch to synchronize shards with each
other. When passed to invoke_on_all() lambda like this, each lambda gets
its its copy of the barrier "handler" that maintains shared state across
shards.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
They are executed frequently during tablet scheduling. Currently, they
have time complexity of O(N*log(N)) in terms of shard count. With
large shard counts, that has significant overhead.
This patch optimizes them down to O(log(N)).
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.
In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.
This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance. Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.
We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.
The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.
For example, for the following parameters:
params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}
The results are:
After:
Overcommit : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
Overcommit : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Before:
Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}
So shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.
Also, node overcommit for table1 was reduced from 1.81 to 1.02.
The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.
The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).
One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.
Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.
Fixes#16824
Will be reused when evaluating different targets for migration in later
stages.
The refactoring drops updating of _stats.for_dc(dc).stop_no_candidates
and we update _stats.for_dc(dc).stop_load_inversion in both cases
where convergence check may fail. The reason is that stat updates must
be outside check_convergence(), since the new use case should not
update those stats (it doesn't stop balancing, just drops
candidates). Propagating the information for distinguishing the two
cases would be a burden. But it's not necessary, since both cases are
actually load inversion cases, one pre-migration the other
post-migration, so we don't need the distinction.
It's actually wrong to increment stop_no_candidates, since there may
still be candidates, it's the load which is inverted.
Rather than maximum per-node shard overcommit. Global shard overcommit
is a better metric since we want to equalize global load not just
per-node load.
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.
This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.
This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3
Fixesscylladb/scylladb#19223
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.
`maybe_rehash` is complimentary and is not strictly required to succeed.
If it fails, it will retry on the next call, but there's no reason
to throw a bad_alloc exception that will fail its caller, since `maybe_rehash`
is called as the final step after the caller has already succeeded with
its action.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want to call `service_level_controller::do_abort()` in all cases.
The current code (introduced in
535e5f4ae7)
calls do_abort if abort was not requested, however, since
it does so by checking the subscription bool operator,
it would miss the case where abort was already requested
before the subscription took place (in service_level_controller
ctor).
With scylladb/seastar@470b539b1c and
scylladb/seastar@8ecce18c51
we can just unconditionally call the subscription `on_abort`
method, that ensures only-once semantics, even if abort
was already requested at subscription time.
Fixesscylladb/scylladb#19075
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19929
mapreduce_server was previously coroutinized, but only partially. This
series completes coroutinization and eliminates remaining continuation chains.
None of this code is performance sensitive as it runs at the super-coordinator level
and is amortized over a full scan of the entire table.
No backport needed as this is a cleanup.
Closesscylladb/scylladb#19913
* github.com:scylladb/scylladb:
mapreduce_service: reindent
mapreduce_service: coroutinize retrying_dispatcher::dispatch_to_node()
mapreduce_service: coroutinize dispatch() inner lambda
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.
We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".
This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.
Fixes#19911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19914
In commit bac7c33313 we introduced a new
test for the Alternator "/localnodes" request, checking that a node
that is still joining does not get returned. The tests used what I
thought were "very high" timeouts - we had a timeout of 10 seconds
for starting a single node, and injected a 20 second sleep to leave
us 10 seconds after the first sleep.
But the test failed in one extremely slow run (a debug build on
aarch64), where starting just a single node took more than 15 seconds!
So in this patch I increase the timeouts significantly: We increase
the wait for the node to 60 seconds, and the sleeping injection to
120 seconds. These should definitely be enough for anyone (famous
last words...).
The test doesn't actually wait for these timeouts, so the ridiculously
high timeouts shouldn't affect the normal runtime of this test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19916
There's a test in object_store suite that verifies the contents of a bucket. It does with the plain http request, but unfortunately this doesn't work -- even local minio uses restricted bucket and using plain http request results in 403(Forbidden) error code. Test doesn't check it and continues working with empty list of objects which, in turn, is what it expects to see.
The fix is in using boto3. With it, the acc/secret pair is picked up and listing the bucket finally works.
Closesscylladb/scylladb#19889
* github.com:scylladb/scylladb:
test/object_store: Use boto3.resource to list bucket
test/object_store: Add get_s3_resource() helper
Instead of plain http request, use the power of boto3 package. The
recently added get_s3_resource() facilitates creating one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It creates boto3.resource object that points to endpoint maintained
by s3_server argument (that tests obtain via fixture). This allows
using boto3 to access S3 bucket from local minio server.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
instead of using fmt::runtime, use compile-time format string in
order to detect the bad format string, or missing format arguments,
or arguments which are not formattable at compile time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19901
* seastar 67065040...a7d81328 (30):
> reactor: Initialize _aio_pollfd later
> abortable_fifo: fix a typo in comment
> net: Expose DNS error category
> pollable_fd_state: use default-generated dtor
> perftune: tune tcp_mem
> scripts/perftune.py: clock source tweaking: special case Amazon and Google KVM virtualizations
> abort_source: subscription: keep callback function alive after abort
> github: disable ccache when building with C++ modules
> github: add enable-ccache input to test.yaml
> pollable_fd_state: Mark destructor protected and make non-virtual
> reactor: Mark .configure() private
> reactor: Set aio_nowait_supported once
> reactor: Add .no_poll_aio to reactor_config
> reactor: Move .max_poll_time on reactor_config
> reactor: Move .task_quota on reactor_config
> reactor: Move .strict_o_direct on reactor_config
> reactor: Move .bypass_fsync on reactor_config
> reactor: Move .max_task_backlog on reactor_config
> reactor: Move .force_io_getevents_syscall on reactor_config
> reactor: Move .have_aio_fsync on reactor_config
> reactor: Move .kernel_page_cache on reactor_config
> reactor: Move .handle_sigint on reactor_config
> reactor_backend: Construct _polling_io from reactor config
> reactor: Move config when constructing
> reactor: Use designated initializers to set up reactor_config
> native-stack: use queue::pop_eventually() in listener::accept()
> abort_source: subscription: allow calling on_abort explicitly
> file: document that close() returns the file object to uninitialized state
> code-cleanup: do not include 'smp.hh' in 'reactor.hh'
> code-cleanup: remove redundant includes of smp.hh
Closesscylladb/scylladb#19912
the parameter names do not match with the ones we are using.
these comments were inherited from Origin, but we failed to update
them accordingly.
in this change, the comments are updated to reflect the function
signatures.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19900
The build_unified.sh script accepts a --build-dir option, which
specifies the directory used for storing temporary files extracted
from tarballs defined by the --pkgs option. When performing parallel
builds of multiple modes, it's crucial that each build uses a unique
build directory. Reusing the same build directory for different modes
can lead to conflicts, resulting in build failures or, more seriously,
the creation of tarballs containing corrupted files.
so, in this change, we specify a different directory for each mode,
so that they don't share the same one.
Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19905
Simplify the function by converting it to a coroutine.
Note that while the final co_return co_await looks like a loop (and
therefore an await would introduce an O(n) allocation), it really isn't -
we retry at most once.
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
single tablet replica
In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.
Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes. This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.
Refs scylladb/scylladb#18862
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19816
This pointer was only needed to pull all the way down the hints resource
manager start() method. It's no longer needed for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By the time hinst resource manager is started, proxy already has its
remote part initialized. Remote returns const gossiper pointer, but
after previous change hints code can live with it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places in hints code that need gossiper: hist_sender
calling gossiper::is_alive() and endpoint_downtime_not_bigger_than()
helper in manager. Both can live with const gossiper, so the dependency
references and anchor pointers can be restricted to const too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.
Fixes#19879
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#19883
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.
As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.
In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.
Fixes: scylladb/scylladb#19439
This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.
Closesscylladb/scylladb#19765
* github.com:scylladb/scylladb:
test: regression test for MV crash with tablets during decommission
db/view: drop view updates to replaced node marked as left
when deleting a base partition, in some cases we can update the view by
generating a single partition deletion update, instead of generating a
row deletion update for each of the partition rows.
If this is the case for all the affected views, and there are no other
updates besides deleting the partition, then we can skip reading and
iterating over all the rows, since this won't generate any additional
updates that are not covered already.
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.
When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.
With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.
Fixesscylladb/scylladb#8199
ScyllaMetrics is a useful generic component for retrieving metrics in a
pytest.
The commit moves the implementation from test_shedding.py to util.py to
make it reusable in other tests in cql-pytest.
There are few api::set_foo()-s left in main that are placed in ~~random~~ legacy order. This PR fixes it and makes few more associated cleanups.
refs: #2737Closesscylladb/scylladb#19682
* github.com:scylladb/scylladb:
api: Unset cache_service endpoints on stop
main: Don't ignore set_cache_service() future
api: Move storage API few steps above
api: Register token-metadata API next to token-metadata itsels
api: Do not return zero local host-id
api: Move snitch API registration next to snitch itself
They currently stay registered long after the dependent services get
stopped. There's a need for batch unsetting (scylladb/seastar#1620), so
currently only this explicit listing :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The call itself seem to be in wrong place -- there's no "cache service"
also the API uses database and snapshot_ctl to work on. So it deserves
more cleanup, but at least don't throw the returned future<> away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sequence currently is
sharded<storage_service>.start()
sharded<query_processor>.invoke_on_all(start_remote)
api::set_server_storage_service()
The last two steps can be safely swapped to keep storage service API
next to its service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now API registration happens quite late because it waits storage
service to register its "function" first. This can be done beforeheand
and the t.m. API can be moved to where it should be.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The local host id is read from local token metadata and returned to the
caller as string. The t.m. itself starts with default-constructed host
id vlaue which is updated later. However, even such "unset" host id
value can be rendered as string without errors. This makes the correct
work of the API endpoint depend on the initialization sequence which may
(spoilter: it will) change in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's finally no longer used. Now only sstables storage code "knows" that
keyspace may have its on-disk directory.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the init_keyspace_storage() expects that the caller would
tell it where the ks directory is, but it's not nice as keyspace may
not necessarity keep its sstables in any directory.
This patch moves the directory path evaluation into storage code,
specifically to the lambda that is called for on-disk sstables. The
way directory is evaluated mirrors the one from make_keyspace_config()
that will be removed by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The admission test has a section which tests admission when the
semaphore has inactive reads. This section (and therefore the enire
test) became flaky lately, after a seemingly unrelated seastar upgrade,
which improved timers.
The cause of the flakyness is the permit which is made inactive later:
this permit is created with 0 timeout (times out immediately). For some
time now, when the timeout timer of a permit fires, if the permit is
inactive, it is evicted. This is what makes the test fail: the inactive
read times out and ends up evicting this permit, which is not expected
for the test. The reason this was not a problem before, is that the test
finishes very quickly, usually, before the timer could even be polled by
the reactor. The recent seastar changes changed this and now the timer
sometimes get polled and fires, failing the test.
Fixes: #19801Closesscylladb/scylladb#19859
Alternator allows authentication into the existing CQL roles, but
roles which have the flag "login=false" should be refused in
authentication, and this patch adds the missing check.
The patch also adds a regression test for this feature in the
test/alternator test framework, in a new test file
test/alternator/cql_rbac.py. This test file will later include more
tests of how the CQL RBAC commands (CREATE ROLE, GRANT, REVOKE)
affect authentication and authorization in Alternator.
In particular, these tests need to use not just the DynamoDB API but
also CQL, so this new test file includes the "cql" fixture that allows
us to run CQL commands, to create roles, to retrieve their secret keys,
and so on.
Fixesscylladb/scylladb#19735Closesscylladb/scylladb#19740
Introduce virtual tasks - task manager tasks which cover
cluster-wide operations.
Virtual tasks aren't kept in memory, instead their statuses
are retrieved from associated service when user requests
them with task manager API. From API users' perspective,
virtual tasks behave similarly to regular tasks, but they can
be queried from any node in a cluster.
Virtual tasks cannot have a parent task. They can have
children on each node in a cluster, but do not keep references
to them. So, if a direct child of a virtual task is unregistered
from task manager, it will no longer be shown in parent's
children vector.
virtual_task class corresponds to all virtual tasks in one
group. If users want to list all tasks in a module, a virtual_task
returns all recent supported operations; if they request virtual
task's status - info about the one specified operation is
presented. Time to live, number of tracked operations etc.
depend on the implementation of individual virtual_task.
All virtual_tasks are kept only on shard 0.
Refs: https://github.com/scylladb/scylladb/issues/15852
New feature, no backport needed.
Closesscylladb/scylladb#16374
* github.com:scylladb/scylladb:
docs: describe virtual tasks
db: node_ops: filter topology request entries
test: add a topology suite for testing tasks
node_ops: service: create streaming tasks
node_ops: register node_ops_virtual_task in task manager
service: node_ops: keep node ops module in storage service
node_ops: implement node_ops_virtual_task methods
db: service: modify methods to get topology_requests data
db: service: add request type column to topology_requests
node_ops: add task manager module and node_ops_virtual_task
tasks: api: add virtual task support to get_task_status_recursively
tasks: api: add virtual task support
tasks: api: add virtual tasks support to get_tasks
tasks: add task_handler to hide task and virtual_task differences from user
tasks: modify invoke_on_task
tasks: implement task_manager::virtual_task::impl::get_children
tasks: keep virtual tasks in task manager
tasks: introduce task_manager::virtual_task
as these workflows are scheduled periodically, and if they fail, notifications are sent to the repo's owner. to minimize the surprises to the contributors using github, let's disable these workflows on fork repos.
Closesscylladb/scylladb#19736
* github.com:scylladb/scylladb:
github: do not run clang-tidy as a cron job
github: disable scheduled workflow on forks
We modify `ScyllaCluster.server_start` so that it changes seeds of the
starting node to all currently running nodes. This allows writing tests like
```python
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
However, it disallows writing tests that start multiple clusters. To fix this,
we add the `seeds` parameter to `server_start`.
We also improve the logic in `ScyllaCluster.add_server` to allow writing
tests like
```python
await manager.server_add(expected_error="...")
await manager.server_add()
```
This PR only adds improvements to the `test.py` framework, no need
to backport it.
Closesscylladb/scylladb#19847
* github.com:scylladb/scylladb:
test: scylla_cluster: improve expected_error in add_server
test: scylla_cluster: support more test scenarios
test: scylla_cluster: correctly change seeds in server_start
We make two changes:
- we lease the IP address of a node that failed to boot because of
an expected error,
- we don't log "Cluster ... added ..." when a node fails to boot
because of an expected error.
Here are some examples of tests that don't work with no initial
nodes, but they should work:
1.
```
await manager.server_add(expected_error="...")
await manager.server_add()
```
2.
```
await manager.servers_add(2, expected_error="...")
await manager.servers_add(2)
```
3.
```
s1 = await manager.server_add(start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_add()
```
4.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_start(s2.server_id, expected_error="...")
await manager.servers_add(2)
```
5.
```
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
6.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.servers_add(2)
await manager.server_start(s1.server_id)
await manager.server_start(s2.server_id)
```
In this patch, we make a few improvements to make tests like the ones
presented above work. I tested all the examples above manually.
From now on, servers receive correct seeds if the first servers added
in the test didn't start or failed to boot.
Also, we remove the assertion preventing the creation of a second
cluster. This assertion failed the tests presented above. We could
weaken it to make these tests pass, but it would require some work.
Moreover, we have tests that intentionally create two clusters.
Therefore, we go for the easiest solution and accept that a single
`ScyllaCluster` may not correspond to a single Scylla cluster.
We change seeds in `ScyllaCluster.server_start` to all currently
running nodes. The previous code only pretended that it did it.
After doing this change, writing tests that create multiple clusters
is impossible. To allow it, we add the `seeds` parameter to
`ManagerClient.server_start`. We use it to fix and simplify the only
test that creates two clusters - `test_different_group0_ids`.
system_keyspace::get_topology_request_entries returns entries for
requests which are running or have finished after specified time.
In task manager node ops task set the time so that they are shown
for task_ttl seconds after they have finished.
Add topology_tasks test suite for testing task manager's node ops
tasks. Add TaskManagerClient to topology_tasks for an easy usage
of task manager rest api.
Write a test for bootstrap, replace, rebuild, decommission and remove
top level tasks using the above.
Keep task manager node ops module in storage service. It will be
used to create and manage tasks related to topology changes.
The module is created and registered in storage service constructor.
In storage_service::stop() the module is stopped and so all the remaining
tasks would be unregistered immediately after they are finished.
Modify get_topology_request_state (and wait_for_topology_request_completion),
so that it doesn't call on_internal_error when request_id isn't
in the topology_requests table if require_entry == false.
Add other methods to get topology request entry.
topology_requests table will be used by task manager node ops tasks,
but it loses info about request type, which is required by tasks.
Add request_type column to topology_requests.
Virtual tasks are supported by get_task_status, abort_task and
wait_task.
Task status returned by get_task_status and wait_task:
- contains task_kind to indicate whether it's virtual (cluster) or
regular (node) task;
- children list apart from task_id contains node address of the task.
task_manager/list_module_tasks/{module} starts supporting virtual tasks,
which means that their stats will also be shown for users.
Additional task_kind param is added to indicate whether the task is
virutal (cluster-wide) or regular (node-wide).
Support in other paths will be added in following patches.
Contrary to regular tasks, which are per-operation, virtual tasks
are associated with the whole group of operations. There may be many
operations of each group performed at the same time. Info about each
running operation will be shown to a user through the API.
For virtual tasks, task manager imitates a regular task covering
each operation, but task_manager::tasks aren't actually created
in the memory. Instead, information (e.g. status) about the operation
is retrieved from associated service and passed to a user.
To hide most of the differences from user, task_handler class is created.
Task handler performs appropriate actions depending on task's kind.
However, users need to stay conscious about the kind of task, because:
- get_task_status and wait_task do not unregister virtual tasks;
- time for which a virtual tasks stays in task manager depends
on associated service and tasks' implementation;
- number of virtual task's children shown by get_tasks doesn't have
to be monotonous.
API is modified to use task_handler.
API-specific classes are moved to task_handler.{cc,hh}.
Virtual tasks are kept in task manager together with regular tasks.
All virtual tasks are stored on shard 0.
task_manager::module::make_task is modified to consider virtual
tasks as possible parents.
A virtual task is a new kind of task supported by task manager,
which covers cluster-wide operations.
From users' perspective virtual tasks behave similarly
to task_manager::tasks. The API side of virtual tasks will be
covered in the following patches.
Contrary to task_manager::task, virtual task does not update
its fields proactively. Moreover, no object is kept in memory
for each individual virtual task's operation. Instead a service
(or services) is queried on API user's demand to learn about
the status of running operation. Hence the name.
task_manager::virtual_task is responsible for a whole group
of virtual tasks, i.e. for tracking and generating statuses
of all operations of similar type.
To enable tracking of some kind of operations, one needs to
override task_manager::virtual_task::impl and provide implementations
of the methods returning appropriate information about the operations.
task_manager::virtual_task must be kept on shard 0.
Similarly to task_manager::tasks, virtual tasks can have child tasks,
responsible for tracking suboperations' progress. But virtual tasks
cannot have parents - they are always roots in task trees.
Some methods and structs will be implemented in later patches.
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.
The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.
The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().
But the hard part of this test is the testing:
1. An existing multi-node test for "/localnodes" assummed that right after
a new node was created, it appears on "/localnodes". But after this
patch, it may take a bit more time for the bootstrapping to complete
and the new node to appear in /localnodes - so I had to add a retry loop.
2. I added a test that reproduces the bug fixed here, and verifies its
fix. The test is in the multi-node topology framework. It adds an
injection which delays the bootstrap, which leaves a new node in JOINING
state for a long time. The test then verifies that the new node is
alive (as checked by the REST API), but is not returned by "/localnodes".
3. The new injection for delaying the bootstrap is unfortunately not
very pretty - I had to do it in three places because we have several
code paths of how bootstrap works without repair, with repair, without
Raft and with Raft - and I wanted to delay all of them.
Fixes#19694.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19725
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:
- with the file, we don't need to accumulate the payloads in memory
and send them in batch, as we do in upload_sink and in
upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
upload applies to it, this helps to improve the throughput.
so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.
Fixes https://github.com/scylladb/scylladb/issues/16287Closesscylladb/scylladb#16387
* github.com:scylladb/scylladb:
s3/client: add client::upload_file()
s3/client: move constants related to aws constraints out
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:
- with the file, we don't need to accumulate the payloads in memory
and send them in batch, as we do in upload_sink and in
upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
upload applies to it, this helps to improve the throughput.
so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.
Fixes#16287
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
minimum_part_size and aws_maximum_parts_in_piece are
AWS S3 related constraints, they can be reused out of
client::upload_sink and client::upload_jumbo_sink, so
in this change
* extract them out.
* use the user-defined literal with IEC prefix for
better readablity to define minimum_part_size
* add "aws_" prefix to `minimum_part_size` to be
more consistent.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.
Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.
Fixes: #19351.
Closesscylladb/scylladb#19834
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.
Not a regression, so a backport isn't strictly necessary.
Closesscylladb/scylladb#19792
* github.com:scylladb/scylladb:
config: specialize from-string conversion for bool
config: wrap boost::lexical_cast<> when converting from strings
If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.
Related to #15820
The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc.
Closesscylladb/scylladb#15971
* github.com:scylladb/scylladb:
commitlog: Make max data lifetime runtime-configurable
db::config: Expose commitlog_max_data_lifetime_in_s parameter
commitlog: Add optional max lifetime parameter to cl instance
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722Closesscylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
`filesystem_storage` methods frequently call `sync_directory()`, for the sake of flushing (sync'ing) a directory. `sync_directory()` always brackets the sync with open and close, and given that most `sync_directory()` calls target the sstable base directory, those repeated opens and closes are considered wasteful. Rework the `filesystem_storage::_dir` member (from a mere pathname) so that it stand for an `opened_directory` object, which keeps the sstable base directory open, for the purpose of repeated sync'ing.
Resolves#2399.
Closesscylladb/scylladb#19624
* github.com:scylladb/scylladb:
sstables/storage: synch "dst_dir" more leanly in create_links_common()
sstables/storage: close previous directory asynchronously upon dir change
sstables/storage: futurize change_dir_for_test()
sstables/storage: sync through "opened_directory" in filesystem...::move()
sstables/storage: sync through "opened_directory" in the "easy" cases
sstables/storage: introduce "opened_directory" class
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.
Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.
This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.
Fixesscylladb/scylladb#19772
Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.
Closesscylladb/scylladb#19787
We already have code to return min() for
the minimum and maximum tokens in long_token()
and raw(), so instead of using code to return
it, just make sure to set it in the _data member.
Note that although this change affect serialization,
the existing codebase ignores the deserialized bytes
and places a constant (0 before this patch, or min()
with it) in _data for non-key (minumum or maximum) tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Users outside of the token module don't
need to mess with the token::kind.
They can only create key tokens.
Never, minimum or maximum tokens, with a particular
datya value.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
sizeof(dht::token) is only 16 bytes and therefore
it can be passed with 2 registers.
There is no sense in defining minimum_token
and maximum_token out of line, returning a token&
to statically allocated values that require memory
access/copy, while the only call sites that needs
to point to the static min/max tokens are in
dht::ring_position_view.
Instead, they can be defined inline as constexpr
functions and return their const values.
Respectively, define token ctors and methods
as constexpr where applicable (and noexcept while at it
where applicable)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure to always initalize the _data member
to 0 for non-key (minimum or maximum) tokens.
This allows to simplify the equality operator
that now doesn't need to rely on `operator<=>`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The is_minimum/is_maximum predicates are more
efficient than comparing the the m{minimum,maximum}_token
values, respectrively. since the is_* functions
need to check only the token kind.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Token comparisons are abundant.
The equality operator is defined inline
in dht/token.hh by calling `t1 <=> t2`,
and so is `tri_compare_raw`, which `operator<=>`
calls in the common path, but `operator<=>` itself
is defined out of line, losing the benefits of inlining.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the rolling restart to avoid spurious driver reconnects.
This can be eventually reverted once the scylladb/python-driver#295 is fixed.
Fixesscylladb/scylladb#19154Closesscylladb/scylladb#19771
* github.com:scylladb/scylladb:
test: raft: fix the flaky `test_raft_recovery_stuck`
test: raft: code cleanup in `test_raft_recovery_stuck`
In v4 of scylladb/scylladb#19598 the last commit of the patch was replaced but this change missed merge so submitting it in a separate patch.
In the current patch, the original functions class correctly marks methods as const where appropriate, and the instance() method now returns a const object. This ensures protection against accidental modifications, as all changes must go through the change_batch object.
Since the functions_changer class was intended to serve the same purpose, it is now redundant. Therefore, we are reverting the commit that introduced it.
Relates scylladb/scylladb#19153Closesscylladb/scylladb#19647
* github.com:scylladb/scylladb:
cql3: functions: replace template with std::function in with_udf_iter()
cql3: functions: improve functions class constness handling
Revert "cql3: functions: make modification functions accessible only via batch class"
Replaced the old `read_barrier` helper from "test/pylib/util.py"
by the new helper from "test/pylib/rest_client.py" that is calling
the newly introduced direct REST API.
Replaced in all relevant tests and decommissioned the old helper.
Introduced a new helper `get_host_api_address` to retrieve the host API
address - which in come cases can be different from the host address
(e.g. if the RPC address is changed).
Fixes: scylladb/scylladb#19662Closesscylladb/scylladb#19739
filesystem_storage::create_links_common() runs on directories that
generally differ from "_dir", thus, we can't replace its sync_directory()
calls with _dir.sync(). We can still use a common (temporary)
"opened_directory" object for synching "dst_dir" three times, saving two
open and two close operations.
This patch is best viewed with "git show -W".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In "filesystem_storage", change_dir_for_test() and move() replace "_dir"
with "opened_directory(new_dir)" using the move assignment operator.
Consequently, the file descriptor underlying "_dir" is closed
synchronously as a part of object destruction.
Expose the async file::close() function through "opened_directory".
Introduce filesystem_storage::change_dir() as a common async workhorse for
both change_dir_for_test() and move(). In change_dir(), close the old
directory asynchronously.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Currently change_dir_for_test() is synchronous. Make it return a future,
so that we can use async operations in change_dir_for_test() overrides.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Near the end of filesystem_storage::move(), we sync both the old
directory, and the new directory, if "delay_commit" is null. At that
point, the new directory is just "_dir"; call _dir.sync() instead of
sync_directory().
This patch is best viewed with "git show -W".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Replace
sst.sstable_write_io_check(sync_directory, _dir.native())
with
_dir.sync(sst._write_error_handler)
Also replace the explicit (but still relatively "easy")
open_checked_directory() + flush() + flush() operations in
filesystem_storage::seal() with two _dir.sync() calls.
Because filesystem_storage::create_links_common() is marked "const", we
need to declare "_dir" mutable.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
"filesystem_storage::_dir" is currently of type "std::filesystem::path".
Introduce a new class called "opened_directory", and change the type of
"_dir" to the new class "opened_directory".
"opened_directory" keeps the directory open, and offers synchronization on
that open directory (i.e., without having to reopen the directory every
time). In subsequent patches, that will be put to use.
The opening and closing of the wrapped directory cannot easily be handled
explicitly in the "filesystem_storage" member functions.
(
Namely, test::store() and test::rewrite_toc_without_scylla_component()
-- both in "test/lib/sstable_utils.hh" -- perform "open -> ... -> seal"
sequences, and such a sequence may be executed repeatedly. For example,
sstable_directory_shared_sstables_reshard_correctly()
[test/boost/sstable_directory_test.cc] does just that; it "reopens" the
"filesystem_storage" object repeatedly.
)
Rather than trying to restrict the order of "filesystem_storage" member
function calls, replace the "opened_directory" object with a new one
whenever the directory pathname is re-set; namely in
filesystem_storage::change_dir_for_test() and filesystem_storage::move().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.
Refs scylladb/scylladb#17636Fixesscylladb/scylladb#17660Closesscylladb/scylladb#19741
* github.com:scylladb/scylladb:
db/hints: Verify that Scylla limits the concurrency of written hints
db/hints: Coroutinize `hint_endpoint_manager::store_hint()`
db/hints: Move a constant value to the TU it's used in
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes#19722
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a new method, on_unlink() to the sstable_manager. This method is
now used by the sstable to notify the manager when it has been unlinked,
enabling the manager to update its bookkeeping as required. The
on_unlink method doesn't do anything yet but will be updated by the next
patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
in 3c7af287, cqlsh's reloc package was marked as "noarch", and its
filename was updated accordingly in `configure.py`, so let's update
the CMake building system accordingly.
this change should address the build failure of
```
08:48:14 [3325/4124] Generating ../Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 FAILED: Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 cd /jenkins/workspace/scylla-master/scylla-ci/scylla/build/dist && /usr/bin/cmake -E copy /jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 Error copying file "/jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz" to "/jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz".
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19710
In some cases, the S3 server will not know about a certain build and
any attempt to open a coredump which was generated by this build will
fail, because the S3 server returns an empty/illegal response.
There is already a bypass for missing package-url in the S3 server
response, but this doesn't help in the case when the response is also
missing other metadata, like build-id and version info.
Extend this existig mechanism with a new --scylla-package-url flag,
which provides complete bypass. When provided, the S3 server will not be
queried at all, instead the package is downloaded from the link and
version metadata is extracted from the package itself.
Closesscylladb/scylladb#19769
* seastar 908ccd93...67065040 (44):
> metrics: Use this_shard_id unconditionally
> sstring: prevent fmt from formatting sstring as a sequence
> coding style: allow lines up to 160 chars in length
> src/core: remove unnecessary includes
> when_all: stop using deprecated std::aligned_union_t
> reactor: respect preempt requests in debug mode
> core: fix -Wunused-but-set-variable
> gate: add try_hold
> sstring: declare nested type with typename
> rpc: pass start time to `wait_for_reply()` which accepts `no_wait_type`
> scripts/perftune.py: get rid of "SyntaxWarning: invalid escape sequence"
> scripts/perftune.py: add support for tweaking VLAN interfaces
> scripts/perftune.py: improve discovery of bond device slaves
> scripts/perftune.py: refactor __learn_slaves() function
> code-cleanup: add missing header guards
> code-cleanup: remove redundant includes of 'reactor.hh'
> code-cleanup: explicitly depend on io_desc.hh
> scripts/perftune.py: aRFS should be disabled by default in non-MQ mode
> code-cleanup: remove unneeded includes of fair_queue.hh
> docker: fix mount of install-dependencies
> code-cleanup: remove redundant includes of linux-aio.hh
> fstream: reformat the doxygen comment of make_file_input_stream()
> iostream: use new-style consumer to implement copy()
> stall-analyser: use 0 for default value of --minimum
> reactor: fix crash during metrics gathering
> build: run socket test with linux-aio reactor backend
> test: Add testing of connect()-ion abort ability
> linux_perf_event: exclude_idle only on x86_64
> linux_perf_event: add make_linux_perf_event
> stall-analyser: gracefully handle empty input
> shared_token_bucket: resolve FIXME
> io_tester: ensure that file object is valid when closing it
> tutorial.md: fix typo in Dan Kegel's name
> test,rpc: Extend simple ping-pong case
> rpc: Calculate delay and export it via metrics
> rpc: Exchange handler duration with server responses
> rpc: Track handler execution time
> rpc: Fix hard-coded constants when sending unknown verb reply
> reactor: Unfriend alien and smp queues
> reactor: Add and use stopped() getter
> reactor: Generalize wakeup() callers
> file: Use lighter access to map of fs-info-s
> file: Fix indentation after previous patch
> file: Don't return chain of ready futures from make_file_impl
Closesscylladb/scylladb#19780
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.
Closesscylladb/scylladb#19721
* github.com:scylladb/scylladb:
sstables: use _origin in write path
sstable::open_sstable: pass and store origin
This PR adds support for aborting index reads from within `index_consume_entry_context::consume_input` when the server is being stopped. The abort source is now propagated down to the `index_consume_entry_context`, making it available for `consume_input` to check if an abort has been requested. If an abort is detected, `consume_input` will throw an exception to stop the index read operation.
Closesscylladb/scylladb#19453
* github.com:scylladb/scylladb:
test/boost: test abort behaviour during index read
sstables/index_reader: stop consuming index when abort has been requested
sstables::index_consume_entry_context: store abort_source
sstable: drop old filter only after the new filter is built during rebuild
sstables/sstables_manager: store abort_source in sstable_manager
replica/database: pass abort_source to database constructor
The yaml/json representation for bool is true/false, but boost::lexical_cast
is 1/0. Specialize bool conversion to accept true/false (for yaml/json
compatibilty) and 1/0 (for backward compatibility). This provides
round-trip conversion for bool configs in system.config.
Configuration uses boost::lexical_cast to convert strings to native
values (e.g. bools/ints). However, boost::lexical_cast doesn't
recognize true/false for bool. Since we can't change boost::lexical_cast,
replace it with a wrapper that forwards directly to boost::lexical_cast.
In the next step, we'll specialize it for bool.
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.
Should be backported to all supported versions.
Closesscylladb/scylladb#19699
* github.com:scylladb/scylladb:
test: add test to check that coordinator lwt semaphore continues functioning after locking failures
paxos: do not signal semaphore if it was not acquired
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.
since fmt 11, it is required that the format() to be const, otherwise
its caller in fmt library would not be able to call it. and compile
would fail like:
```
/home/kefu/.local/bin/clang++ -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/abseil -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o -MF locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o.d -o locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o -c /home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc:9:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:16:
In file included from /home/kefu/dev/scylladb/gms/inet_address.hh:11:
In file included from /usr/include/fmt/ostream.h:23:
In file included from /usr/include/fmt/chrono.h:23:
In file included from /usr/include/fmt/format.h:41:
/usr/include/fmt/base.h:1393:23: error: no matching member function for call to 'format'
1393 | ctx.advance_to(cf.format(*static_cast<qualified_type*>(arg), ctx));
| ~~~^~~~~~
/usr/include/fmt/base.h:1374:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::context>::format_custom_arg<locator::vnode_effective_replication_map::factory_key, fmt::formatter<locator::vnode_effective_replication_map::factory_key>>' requested here
1374 | custom.format = format_custom_arg<
| ^
/home/kefu/dev/scylladb/seastar/include/seastar/util/log.hh:299:33: note: in instantiation of function template specialization 'fmt::format_to<seastar::internal::log_buf::inserter_iterator &, locator::vnode_effective_replication_map::factory_key &, const void *, 0>' requested here
299 | return fmt::format_to(it, fmt.format, std::forward<Args>(args)...);
| ^
/home/kefu/dev/scylladb/seastar/include/seastar/util/log.hh:428:9: note: in instantiation of function template specialization 'seastar::logger::log<locator::vnode_effective_replication_map::factory_key &, const void *>' requested here
428 | log(log_level::debug, std::move(fmt), std::forward<Args>(args)...);
| ^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc:561:18: note: in instantiation of function template specialization 'seastar::logger::debug<locator::vnode_effective_replication_map::factory_key &, const void *>' requested here
561 | rslogger.debug("create_effective_replication_map: found {} [{}]", key, fmt::ptr(erm.get()));
| ^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:471:10: note: candidate function template not viable: 'this' argument has type 'const fmt::formatter<locator::vnode_effective_replication_map::factory_key>', but method is not marked const
471 | auto format(const locator::vnode_effective_replication_map::factory_key& key, FormatContext& ctx) {
| ^
1 error generated.
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19768
Now that the origin is available inside the sstable object, no need to
pass it to the methods called in the write path.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a new boost test, index_reader_test, with a testcase to verifyi
the abort behaviour during an index read using
index_consume_entry_context.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When an abort is requested, stop further reading of the index file and
throw and exception from index_consume_entry_context::process_state().
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Store abort source inside sstables::index_consume_entry_context, so that
the next patch can implement cancelling the index read when abort is
requested.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
sstable::maybe_rebuild_filter_from_index drops the existing filter first
and then rebuilds the new filter as the method is only called before the
sstable is sealed. But to make the index read abortable, the old filter
can be dropped only after the new filter is built so that in case if the
index consumer gets aborted, we still have the old filter to write to
disk.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add a new member that stores the abort_source. This can later be used by
the sstables to check if an abort has been requested. Also implement
sstables_manager::get_abort_source() that returns a const reference to
the abort source.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This is in preparation for the following patch that adds abort_source
variable to the sstables_manager.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When a node that is permanently down is replaced, it is marked as "left"
but it still can be a replica of some tablets. We also don't keep IPs of
nodes that have left and the `node` structure for such node returns an
empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired
with the left node might decide to generate a view update. Because
storage proxy still uses IPs and not host IDs, it needs to obtain the
view replica's IP and tell the storage proxy to write a view update to
that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to
write a hint towards this address - hinted handoff on the other hand
operates on host IDs and not IPs, so it attempts to translate the IP
back, which triggers an assertion as there is no replica with IP
0.0.0.0.
As a quick workaround for this issue just drop view updates towards
nodes which seem to have IPs that are all zeros. It would be more proper
to keep the view updates as hints and replay them later to the new
paired replica, but achieving this right now would require much more
significant changes. For now, fixing a crash is more important than
keeping views consistent with base replicas.
Fixes: scylladb/scylladb#19439
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.
Fixes: https://github.com/scylladb/scylladb/issues/19698
as these workflows are scheduled periodically, and if they fail,
notifications are sent to the repo's owner. to minimize the surprises
to the contributors using github, let's disable these workflows on
fork repos.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
there's not a 1:1 relationship between compaction group count and
tablet count. a tablet replica has a storage group instance, which
may map to multiple compaction groups during split mode.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Until now, the constant `HINT_FILE_WRITE_TIMEOUT` was
declared as a static member of `db::hints::manager`.
However, the constant is only ever used in one
translation unit, so it makes more sense to move it
there and not include boilerplate in a header.
If set, any remaining segment that has data older than this threshold
will request flushing, regardless of data pressure. I.e. even a system
where nothing happends will after X seconds flush data to free up the
commit log.
2024-07-09 12:30:48 +00:00
2930 changed files with 124988 additions and 42154 deletions
parser.add_argument('--commits',default=None,type=str,help='Range of promoted commits.')
parser.add_argument('--pull-request',type=int,help='Pull request number to be backported')
parser.add_argument('--head-commit',type=str,required=is_pull_request(),help='The HEAD of target branch after the pull request specified by --pull-request is merged')
body: `❌ License header check failed. Please ensure all new files include the header within the first ${{ env.HEADER_CHECK_LINES }} lines:\n\`\`\`\n${license}\n\`\`\`\nSee action logs for details.`
if ! curl -X POST "$JENKINS_URL/job/$JOB_NAME/buildWithParameters" --fail --user "$JENKINS_USER:$JENKINS_API_TOKEN" -i -v; then
echo "Error: Jenkins job trigger failed"
# Send Slack message
curl -X POST -H 'Content-type: application/json' \
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
--data '{
"channel": "#releng-team",
"text": "🚨 @here '$JOB_NAME' failed to be triggered, please check https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }} for more details",
console.log("Looking for issues with labels:"+labelFilters+", excluding labels:"+excludingLabelFilters+ ", inactive for more than "+daysInactive+" days.");
@@ -12,7 +12,7 @@ Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to re
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
The `--name` argument can be specified to run a particular test.
Alternatively, you can execute the test executable directly. For example,
@@ -267,21 +280,45 @@ Once the patch set is ready to be reviewed, push the branch to the public remote
### Development environment and source code navigation
Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt`, for use only with development environments (not for building) so that they can properly analyze the source code.
Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt` that can be used with development environments so
that they can properly analyze the source code. However, building with CMake is not yet officially supported.
[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice for code hygiene, though its C++ parser sometimes makes errors and flags false issues.
Good IDEs that have support for CMake build toolchain are [CLion](https://www.jetbrains.com/clion/),
[KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).
Other good options that directly parse CMake files are [KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).
[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects and its
C++ parser has many issues.
To use the `CMakeLists.txt` file with these programs, define the `FOR_IDE` CMake variable or shell environmental variable.
#### CLion
[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects, and its C++ parser has many similar issues as CLion.
[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice
for code hygiene, though its C++ parser sometimes makes errors and flags false issues. In order to enable proper code
analysis in CLion, the following steps are needed:
1. Get the ScyllaDB source code by following the [Getting the source code](#getting-the-source-code).
2. Follow the steps in [Dependencies](#dependencies) in order to install the required tools natively into your system.
**Don't** follow the *frozen toolchain* part described there, since CMake checks for the build dependencies installed
in the system, not in the container image provided by the toolchain.
3. In CLion, select `File`→`Open` and select the main ScyllaDB directory in order to open the CMake project there. The
project should open and fail to process the `CMakeLists.txt`. That's expected.
4. In CLion, open `File`→`Settings`.
5. Find and click on `Toolchains` (type *toolchains* into search box).
6. Select the toolchain you will use, for instance the `Default` one.
7. Type in the following system-installed tools to be used:
-`CMake`: *cmake*
-`Build Tool`: *ninja*
-`C Compiler`: *clang*
-`C++ Compiler`: *clang*
8. On the `CMake` panel/tab, click on `Reload CMake Project`
After that, CLion should successfully initialize the CMake project (marked by `[Finished]` in the console) and the
source code editor should provide code analysis support normally from now on.
### Distributed compilation: `distcc` and `ccache`
Scylla's compilations times can be long. Two tools help somewhat:
- [ccache](https://ccache.samba.org/) caches compiled object files on disk and re-uses them when possible
- [ccache](https://ccache.samba.org/) caches compiled object files on disk and reuses them when possible
- [distcc](https://github.com/distcc/distcc) distributes compilation jobs to remote machines
A reasonably-powered laptop acts as the coordinator for compilation. A second, more powerful, machine acts as a passive compilation server.
By utilizing or accessing the Software in any manner, You hereby confirm and agree to be bound by this ScyllaDB Software License Agreement (the "**Agreement**"), which sets forth the terms and conditions on which ScyllaDB Ltd. ("**Licensor**") makes the Software available to You, as the Licensee. If Licensee does not agree to the terms of this Agreement or cannot otherwise comply with the Agreement, Licensee shall not utilize or access the Software.
The terms "**You**" or "**Licensee**" refer to any individual accessing or using the Software under this Agreement ("**Use**"). In case that such individual is Using the Software on behalf of a legal entity, You hereby irrevocably represents and warrants that You have full legal capacity and authority to enter into this Agreement on behalf of such entity as well as bind such entity to this Agreement, and in such case, the term "You" or "Licensee" in this Agreement will refer to such entity.
**Grant of License**
* **Software Definitions:** Software means the ScyllaDB software provided by Licensor, including the source code, object code, and any accompanying documentation or tools, or any part thereof, as made available under this Agreement.
* **Grant of License:** Subject to the terms and conditions of this Agreement, Licensor grants You a limited, non-exclusive, revocable, non-sublicensable, non-transferable, royalty free license to Use the Software, in each case solely for the purposes of:
1) Copying, distributing, evaluating (including performing benchmarking or comparative tests or evaluations , subject to the limitations below) and improving the Software and ScyllaDB; and
2) create a modified version of the Software (each, a "**Licensed Work**"); provided however, that each such Licensed Work keeps all or substantially all of the functions and features of the Software, and/or using all or substantially all of the source code of the Software. You hereby agree that all the Licensed Work are, upon creation, considered Licensed Work of the Licensor, shall be the sole property of the Licensor and its assignees, and the Licensor and its assignees shall be the sole owner of all rights of any kind or nature, in connection with such Licensed Work. You hereby irrevocably and unconditionally assign to the Licensor all the Licensed Work and any part thereof. This License applies separately for each version of the Licensed Work, which shall be considered "Software" for the purpose of this Agreement.
**License Limitations, Restrictions and Obligations:** The license grant above is subject to the following limitations, restrictions, and obligations. If Licensee’s Use of the Software does not comply with the above license grant or the terms of this section (including exceeding the Usage Limit set forth below), Licensee must: (i) refrain from any Use of the Software; and (ii) purchase a [commercial paid license](https://www.scylladb.com/scylladb-proprietary-software-license-agreement/) from the Licensor.
* **Updates:** You shall be solely responsible for providing all equipment, systems, assets, access, and ancillary goods and services needed to access and Use the Software. Licensor may modify or update the Software at any time, without notification, in its sole and absolute discretion. After the effective date of each such update, Licensor shall bear no obligation to run, provide or support legacy versions of the Software.
* **"Usage Limit":** Licensee's total overall available storage across all deployments and clusters of the Software and the Licensed Work under this License shall not exceed 10TB and/or an upper limit of 50 VCPUs (hyper threads).
* **IP Markings:** Licensee must retain all copyright, trademark, and other proprietary notices contained in the Software. You will not modify, delete, alter, remove, or obscure any intellectual property, including without limitations licensing, copyright, trademark, or any other notices of Licensor in the Software.
* **License Reproduction:** You must conspicuously display this Agreement on each copy of the Software. If You receive the Software from a third party, this Agreement still applies to Your Use of the Software. You will be responsible for any breach of this Agreement by any such third-party.
* Distribution of any Licensed Works is permitted, provided that: (i) You must include in any Licensed Work prominent notices stating that You have modified the Software, (ii) You include a copy of this Agreement with the Licensed Work, and (iii) You clearly identify all modifications made in the Licensed Work and provides attribution to the Licensor as the original author(s) of the Software.
* **Commercial Use Restrictions:** Licensee may not offer the Software as a software-as-a-service (SaaS) or commercial database-as-as-service (dBaaS) offering. Licensee may not use the Software to compete with Licensor's existing or future products or services. If your Use of the Software does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its affiliated entities, or you must refrain from using the Software and all Licensed Work. Furthermore, if You make any written claim of patent infringement relating to the Software, Your patent license for the Software granted under this Agreement terminates immediately.
* Notwithstanding anything to the contrary, under the License granted hereunder, You shall not and shall not permit others to: (i) transfer the Software or any portions thereof to any other party except as expressly permitted herein; (ii) attempt to circumvent or overcome any technological protection measures incorporated into the Software; (iii) incorporate the Software into the structure, machinery or controls of any aircraft, other aerial device, military vehicle, hovercraft, waterborne craft or any medical equipment of any kind; or (iv) use the Software or any part thereof in any unlawful, harmful or illegal manner, or in a manner which infringes third parties’ rights in any way, including intellectual property rights.
**Monitoring; Audit**
* **License Key:** Licensor may implement a method of authentication, e.g., a unique license token ("License Key") as a condition of accessing or using the Software. Upon the implementation of such License Key, Licensee agrees to comply with Licensor terms and requirements with regards to such License Key
* **Monitoring & Data Sharing:** Licensor do not collect customer data from its database. Notwithstanding, Licensee acknowledges and agrees that the License Key and Software may share telemetry metrics and information regarding the execution volume and statistics with Licensor regarding Licensee’s use of the same. Any disclosure or use of such information shall be subject to, and in accordance with, Licensor’s Privacy Policy and Data Processing Agreement, which can be found at [https://www.scylladb.com/policies-agreements](https://www.scylladb.com/policies-agreements).
* **Information Requests; Audits:** Licensee shall keep accurate records of its access to and use of any Software, and shall promptly respond to any Licensor requests for information regarding the same. To ensure compliance with the terms of this Agreement, during the term of this Agreement and for a period of one (1) year thereafter, Licensor (or an agent bound by customary confidentiality undertakings on its behalf) may audit Licensee’s records which are related to its access to or use of the Software. The cost of such audit shall be borne by Licensor unless it is determined that Licensee has materially breached this Agreement.
**Termination**
* **Termination:** Licensor may immediately terminate this Agreement will automatically terminate if You for any reason, including without limitation for (i) Licensee’s breach of any term, condition, or restriction of this Agreement, unless such breach was cured to Licensor’s satisfaction within no more than 15 days from the date of the breach. Notwithstanding the foregoing, intentional; or (ii) if Licensee brings any claim, demand or repeated breaches lawsuit against Licensor.
* **Obligations on Termination:** Upon termination of this Agreement by You will cause Your licenses to terminate automatically and permanently, at Licensor’s sole discretion, Licensee must (i) immediately stop using any Software, (ii) return all copies of any tools or documentation provided by Licensor; and (iii) pay amount due to Licensor hereunder (e.g., audit costs). All obligations which by their nature must survive the termination of this Agreement shall so survive.
**Indemnity; Disclaimer; Limitation of Liability**
* **Indemnity:** Licensee hereby agrees to indemnify, defend and hold harmless Licensor and its affiliates from any losses or damages incurred due to a third party claim arising out of: (i) Licensee’s breach of this Agreement; (ii) Licensee’s negligence, willful misconduct or violation of law, or (iii) Licensee’s products or services.
* DISCLAIMER OF WARRANTIES: LICENSEE AGREES THAT LICENSOR HAS MADE NO EXPRESS WARRANTIES REGARDING THE SOFTWARE AND THAT THE SOFTWARE IS BEING PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. LICENSOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THE SOFTWARE, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE; TITLE; MERCHANTABILITY; OR NON-INFRINGEMENT OF THIRD PARTY RIGHTS. LICENSOR DOES NOT WARRANT THAT THE SOFTWARE WILL OPERATE UNINTERRUPTED OR ERROR FREE, OR THAT ALL ERRORS WILL BE CORRECTED. LICENSOR DOES NOT GUARANTEE ANY PARTICULAR RESULTS FROM THE USE OF THE SOFTWARE, AND DOES NOT WARRANT THAT THE SOFTWARE IS FIT FOR ANY PARTICULAR PURPOSE.
* LIMITATION OF LIABILITY: TO THE FULLEST EXTENT PERMISSIBLE UNDER APPLICABLE LAW, IN NO EVENT WILL LICENSOR AND/OR ITS AFFILIATES, EMPLOYEES, OFFICERS AND DIRECTORS BE LIABLE TO LICENSEE FOR (I) ANY LOSS OF USE OR DATA; INTERRUPTION OF BUSINESS; OR ANY INDIRECT; SPECIAL; INCIDENTAL; OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING LOST PROFITS); AND (II) ANY DIRECT DAMAGES EXCEEDING THE TOTAL AMOUNT OF ONE THOUSAND US DOLLARS ($1,000). THE FOREGOING PROVISIONS LIMITING THE LIABILITY OF LICENSOR SHALL APPLY REGARDLESS OF THE FORM OR CAUSE OF ACTION, WHETHER IN STRICT LIABILITY, CONTRACT OR TORT.
**Proprietary Rights; No Other Rights**
* **Ownership:** Licensor retains sole and exclusive ownership of all rights, interests and title in the Software and any scripts, processes, techniques, methodologies, inventions, know-how, concepts, formatting, arrangements, visual attributes, ideas, database rights, copyrights, patents, trade secrets, and other intellectual property related thereto, and all derivatives, enhancements, modifications and improvements thereof. Except for the limited license rights granted herein, Licensee has no rights in or to the Software and/ or Licensor’s trademarks, logo, or branding and You acknowledge that such Software, trademarks, logo, or branding is the sole property of Licensor.
* **Feedback:** Licensee is not required to provide any suggestions, enhancement requests, recommendations or other feedback regarding the Software ("Feedback"). If, notwithstanding this policy, Licensee submits Feedback, Licensee understands and acknowledges that such Feedback is not submitted in confidence and Licensor assumes no obligation, expressed or implied, by considering it. All right in any trademark or logo of Licensor or its affiliates and You shall make no claim of right to the Software or any part thereof to be supplied by Licensor hereunder and acknowledges that as between Licensor and You, such Software is the sole proprietary, title and interest in and to Licensor.such Feedback shall be assigned to, and shall become the sole and exclusive property of, Licensor upon its creation.
* Except for the rights expressly granted to You under this Agreement, You are not granted any other licenses or rights in the Software or otherwise. This Agreement constitutes the entire agreement between You and the Licensor with respect to the subject matter hereof and supersedes all prior or contemporaneous communications, representations, or agreements, whether oral or written.
* **Third-Party Software:** Customer acknowledges that the Software may contain open and closed source components (“OSS Components”) that are governed separately by certain licenses, in each case as further provided by Company upon request. Any applicable OSS Component license is solely between Licensee and the applicable licensor of the OSS Component and Licensee shall comply with the applicable OSS Component license.
* If any provision of this Agreement is held to be invalid or unenforceable, such provision shall be struck and the remaining provisions shall remain in full force and effect.
**Miscellaneous**
* **Miscellaneous:** This Agreement may be modified at any time by Licensor, and constitutes the entire agreement between the parties with respect to the subject matter hereof. Licensee may not assign or subcontract its rights or obligations under this Agreement. This Agreement does not, and shall not be construed to create any relationship, partnership, joint venture, employer-employee, agency, or franchisor-franchisee relationship between the parties.
* **Governing Law & Jurisdiction:** This Agreement shall be governed and construed in accordance with the laws of Israel, without giving effect to their respective conflicts of laws provisions, and the competent courts situated in Tel Aviv, Israel, shall have sole and exclusive jurisdiction over the parties and any conflict and/or dispute arising out of, or in connection to, this Agreement
seastar::metrics::description("number of operations via Alternator API"),{op(CamelCaseName)}).set_skip_when_empty(),
seastar::metrics::description("number of operations via Alternator API"),{op(CamelCaseName),alternator_label,basic_level}).set_skip_when_empty(),
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnto_metrics_histogram(api_operations.name.histogram());}).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(), \
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName),alternator_label,basic_level},[this]{returnto_metrics_histogram(api_operations.name.histogram());}).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(), \
seastar::metrics::description("Latency summary of an operation via Alternator API"),[this]{returnto_metrics_summary(api_operations.name.summary());})(op(CamelCaseName)).set_skip_when_empty(),
seastar::metrics::description("Latency summary of an operation via Alternator API"),[this]{returnto_metrics_summary(api_operations.name.summary());})(op(CamelCaseName))(basic_level)(alternator_label).set_skip_when_empty(),
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements"))(alternator_label).set_skip_when_empty(),
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("DeleteItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("UpdateItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::description("number of rows read and dropped during filtering operations")),
seastar::metrics::description("number of rows read and dropped during filtering operations"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count",seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},
seastar::metrics::make_counter("batch_item_count",seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},
seastar::metrics::description("number of table scans (counting each scan of each table that enabled expiration)")),
seastar::metrics::description("number of table scans (counting each scan of each table that enabled expiration)"))(alternator_label).set_skip_when_empty(),
seastar::metrics::description("number of token ranges scanned by this node while their primary owner was down")),
seastar::metrics::description("number of token ranges scanned by this node while their primary owner was down"))(alternator_label).set_skip_when_empty(),
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/backup",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a specified keyspace to a designated bucket in object storage",
"type":"string",
"nickname":"start_backup",
"produces":[
"application/json"
],
"parameters":[
{
"name":"endpoint",
"description":"ID of the configured object storage endpoint to copy sstables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"bucket",
"description":"Name of the bucket to backup sstables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"prefix",
"description":"The prefix of the objects for the backuped sstables",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy sstables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy sstables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of a snapshot to copy sstables from",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"move_files",
"description":"Move component files instead of copying them",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/restore",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
"type":"string",
"nickname":"start_restore",
"produces":[
"application/json"
],
"parameters":[
{
"name":"endpoint",
"description":"ID of the configured object storage endpoint to copy SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"bucket",
"description":"Name of the bucket to read SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"prefix",
"description":"The prefix of the object keys for the backuped SSTables",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"in":"body",
"name":"sstables",
"description":"The list of the object keys of the TOC component of the SSTables to be restored",
"required":true,
"schema":{
"type":"array",
"items":{
"type":"string"
}
}
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"scope",
"description":"Defines the set of nodes to which mutations can be streamed",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -1491,38 +1656,6 @@
}
]
},
{
"path":"/storage_service/truncate/{keyspace}",
"operations":[
{
"method":"POST",
"summary":"Truncates (deletes) the given columnFamily from the provided keyspace. Calling truncate results in actual deletion of all data in the cluster under the given columnFamily and it will fail unless all hosts are up. All data in the given column family will be deleted, but its definition will not be affected.",
"type":"void",
"nickname":"truncate",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"cf",
"description":"Column family name",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/keyspaces",
"operations":[
@@ -1891,6 +2024,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"force",
"description":"Enforce the source_dc option, even if it unsafe to use for rebuild",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -2696,6 +2837,70 @@
}
]
},
{
"path":"/storage_service/tablets/repair",
"operations":[
{
"nickname":"repair_tablet",
"method":"POST",
"summary":"Repair a tablet",
"type":"tablet_repair_result",
"produces":[
"application/json"
],
"parameters":[
{
"name":"ks",
"description":"Keyspace name to repair",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Table name to repair",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"tokens",
"description":"Tokens owned by the tablets to repair. Multiple tokens can be provided using a comma-separated list. When set to the special word 'all', all tablets will be repaired",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"hosts_filter",
"description":"Repair replicas listed in the comma-separated host_id list.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"dcs_filter",
"description":"Repair replicas listed in the comma-separated DC list",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"await_completion",
"description":"Set true to wait for the repair to complete. Set false to skip waiting for the repair to complete. When the option is not provided, it defaults to false.",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.