In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.
Fixes: scylladb/scylladb#19849Fixes: scylladb/scylladb#17857Closesscylladb/scylladb#20909
* github.com:scylladb/scylladb:
fix nodetool status to show zero-token nodes
test: move `wait_for_first_completed` to pylib/util.py
token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
* Add `--max-failures` flag to test.py, which will stop the execution after number of failures
* Helps with "fails-fast" approach and can be used to improve CI speed, especially the 100times run
* Adds the number of cancelled tests to both summary and junit xml. I did not include them in boost, since it does not contain any statistics.
* Removes unnecessary list creation in test.py
* Completely unrelated change, but it is small enough that I feel it can be included as part of this one. If this is an issue I can create separate PR for it
* Add `Test.started` property
* Helps with determining the current status of the Test and differentiating cancelled/not started tests.
* Add `Test.failed` and `Test.did_not_run` read-only computed properties
* Helper methods to determine status, instead of using `Test.success`, which does not tell the entire story
* Fix `ScyllaClusterManager.stop()` method, so it doesn't fail when ran multiple times
* This happens when tasks are cancelled, not sure yet why, it almost certainly non-wanted behaviour but this behaviour was already there and with this fix it no longer causes errors
I will use backport/None for now as it is a new feature.
Fixes https://github.com/scylladb/qa-tasks/issues/1714Closesscylladb/scylladb#21098
* github.com:scylladb/scylladb:
test.py: Add option to fail after number of failures
test.py: Add started, failed and did_not_run properties to Test
test.py: Remove unnecessary list creation
test: lib: Fix ScyllaClusterManager.stop()
When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction.
This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload.
The issue of view updates causing increased latencies most often occurs in the following scenario:
* we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows
* we have any read workload
* one replica is slower or is handling more writes due to an imbalance of data distribution
* we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it.
* each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued
* the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies
An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario:
* the tables were prepared satisfying the conditions above
* we use a medium write workload and a very low read workload
* the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update
* to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond.
In the test above:
* without the fix, the latency of reads increases over 50s
* with the fix, the latency of reads stays below 20ms
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
The patch is not that small and it isn't fixing a regression, so no backports
Closesscylladb/scylladb#20887
* github.com:scylladb/scylladb:
test: add test for high view update concurrency causing bad_allocs
test: add test for high view update concurrency degrading read latency
mv: add a dedicated read concurrency semaphore for view update read before writes
Before python 3.12 formatted strings couldn't have reused quotes.
Change the type of quotation mark in get_cgroup so it could be
used with earlier python versions.
Closesscylladb/scylladb#21209
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.
This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.
The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.
This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.
The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.
Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
aiohttp 3.10.5 complains when 'unix+http' is used for a unix-domain
socket. USe 'http', which work with 3.10.5 and the toolchain's 3.9.5.
Closesscylladb/scylladb#21080
fixes#20517
Adds `aws_error` which possibly can contain errors from the S3 response body. Adds to the multipart upload completion a check for possible error and issues a retry if the error is retryable
Closesscylladb/scylladb#20518
* github.com:scylladb/scylladb:
test: add complete_multipart_upload completion tests
code: s3 client error handling
code: add response parsing and error handling to the complete_multipart_upload
code: Introduce AWS errors parsing
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.
Fixes: #20240Closesscylladb/scylladb#21007
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.
in order to fill the gap, in this change:
* add the "prefix" parameter for specifying the shared prefix of
sstables
* add the "sstables" parameter for specifying the list of TOC
components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
on scylla's end anymore.
* make the "table" parameter mandatory.
Fixes https://github.com/scylladb/scylladb/issues/20461
----
this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt.
Closesscylladb/scylladb#20685
* github.com:scylladb/scylladb:
treewide: accept list of sstables in "restore" API
sstable: pass get_storage_option to sstable_directory::load_sstable()
test/nodetool: add body parameter to `expected_request`
tools/scylla-nodetool: enable nodetool to write HTTP body
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes#20044.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.
in order to fill the gap, in this change:
* add the "prefix" parameter for specifying the shared prefix of
sstables
* add the "sstables" parameter for specifying the list of TOC
components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
on scylla's end anymore.
* make the "table" parameter mandatory.
Fixesscylladb/scylladb#20461
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
A primitive python http server is processing s3 client requests and issues either success or error. A multipart uploader should fail or succeed (with or without retries) depending on aforementioned server response
To reduce the amount of space needed for reports, this PR will modify logs
attachment in allure, so it will attach logs only for the tests that have
status other than PASSED. To simplify the solution, with the current way it's
not possible to switch off these logs completely.
Closesscylladb/scylladb#20786
When `property_file` is provided, we generate a
`cassandra-rackdc.properties` file, but to actually use it,
`endpoint_snitch` must be set to `GossipingPropertyFileSnitch`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20730
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.
in this change:
* api/storage_service: add "table" parameter to "backup" API.
* snapshot_ctl: compose the full path of the snapshot directory in
`snapshot_ctl::start_backup`. since we have all the information
for composing the snapshot directory, and what the `backup_task_impl`
class is interested is but the snapshot directory, we just pass
the path to it instead the individual components of the directory.
* backup_task_impl: instead of scan the whole keyspace recursively,
only scan the specified snapshot directory.
Fixesscylladb/scylladb#20636
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Change order of functions: firstly remount, then change ownership for
cgroup. It was not failing before because with privileged mode, it will
mount cgroups as RW, but it's better to have this check if behavior will
change.
Closesscylladb/scylladb#20676
For the benefit of running test.py inside CI, we recently added to
test/cql-pytest and test/alternator the knowledge of which "Scylla mode"
(--mode) and "run number" is running (--run_id), although these concepts
are alien to these two test frameworks (remember that those test frameworks
can also run tests against unknown versions of Scylla or even our competitors'
implementations).
One unfortunate result of this change is that now if you run a test by
using pytest directly (or test/*/run) instead of test.py, for example:
$ cd test/alternator
$ pytest --aws test_item.py::test_basic_string_put_and_get
The test's success or failure reports the ugly name
test_item.py::test_basic_string_put_and_get.no_mode.1
This unnecessary "no_mode.1" come from the the default values for --mode
and --run_id, respectively. But there is no reason for these silly
defaults. In this patch we change these defaults to None, and when they
are None, they aren't tacked onto the test's name.
This patch shouldn't affect running tests through test.py, because
test.py always sets the --mode and --run_id options, and doesn't leave
them as the default.
Fixes#20512
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20513
This PR adds the possibility to gather resource consumption metrics. The collected metrics can be used to compare performance before and after specific changes aimed at increasing performance. Currently, this functionality works only in manual mode, and this is just raw data. Later on, these metrics can be used in Jupyter notebook to analyze and visualize how the resources are used and can provide the insight on how to improve it. This PR is a first insight after gathering these metrics.
Add the possibility to gather resource consumption for the test.py execution. SQLite DB will be created with different performance metrics that will allow comparing the resource consumption between changes.
The DB will be in the tmp directory that by default set to testlog. Across the runs, the DB will not be deleted, so each new run will just add information to the existing DB.
Parameter --get-metrics was added to switch on or off the metrics gathering. By default, it's switched on.
Closes: scylladb/qa-tasks#1666Closes: scylladb/qa-tasks#1707Closesscylladb/scylladb#19881
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.
The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.
New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.
The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.
When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.
Fixesscylladb/scylladb#15329
dtest https://github.com/scylladb/scylla-dtest/pull/4827Closesscylladb/scylladb#19745
* github.com:scylladb/scylladb:
view: test view_build_status table with node replace
test/pylib: use view_build_status_v2 table in wait_for_view
view_builder: common write view_build_status function
view_builder: improve migration to v2 with intermediate phase
view: delete node rows from view_build_status on node removal
view: sanitize view_build_status during migration
view: make old view_build_status table a virtual table
replica: move streaming_reader_lifecycle_policy to header file
view_builder: test view_build_status_v2
storage_service: add view_build_status to raft snapshot
view_builder: migration to v2
db:system_keyspace: add view_builder_version to scylla_local
view_builder: read view status from v2 table
view_builder: introduce writing status mutations via raft
view_builder: pass group0_client and qp to view_builder
view_builder: extract sys_dist status operations to functions
db:system_keyspace: add view_build_status_v2 table
In test.py every asyncio task spawned during the test must be finished before the next test, otherwise, tests might affect each other results.
The developers are responsible for writing asyncio code in a way that doesn’t leave task objects unfinished.
Test.py has a mechanism that helps test writers avoid such tasks. At the end of each test case, it verifies that the test did not produce/leave any tasks and sets an event object that fails the next test at the start if this is the case(issue https://github.com/scylladb/scylladb/issues/16472)
The problem with this was that breaking the next test was counterintuitive, and the logging for this situation was insufficient and unobvious.
notes: Task.cancel() is not an option to avoid task leakage
1) Calling cancel() Does Not Cancel The Task : the cancel() method just request that the target task cancel.
2) Calling cancel() Does Not Block Until The Task is Cancelled: If the caller needs to know the task is cancelled and done, it could await for the target
3) In particular PR, task.cancel() cancell task on client(ManagerClient) but not on http server(ScyllaManager). so "await" is needed.
Closesscylladb/scylladb#20012
Any expired tombstone can be garbage collected if it doesn't shadow data in the commit log, memtable, or uncompacting SSTables.
This PR introduces a new mode to major compaction, enabled by the `consider_only_existing_data` flag that bypasses these checks. When enabled, memtables and old commitlog segments are cleared with a system-wide flush and all the sstables (after flush) are included in the compaction, so that it works with all data generated up to a given time point.
This new mode works with the assumption that newly written data will not be shadowed by expired tombstones. So it ignores new sstables (and new data written to memtable) created after compaction started. Since there was a system wide flush, commitlog checks can also be skipped when garbage collecting tombstones. Introducing data shadowed by a tombstone during compaction can lead to undefined behavior, even without this PR, as the tombstone may or may not have already been garbage collected.
Fixes#19728Closesscylladb/scylladb#20031
* github.com:scylladb/scylladb:
cql-pytest: add test to verify consider_only_existing_data compaction option
tools/scylla-nodetool: add consider-only-existing-data option to compact command
api: compaction: add `consider_only_existing_data` option
compaction: consider gc_check_only_compacting_sstables when deducing max purgeable timestamp
compaction: do not check commitlog if gc_check_only_compacting_sstables is enabled
tombstone_gc_state: introduce with_commitlog_check_disabled()
compaction: introduce new option to check only compacting sstables for gc
compaction: rename maybe_flush_all_tables to maybe_flush_commitlog
compaction: maybe_flush_all_tables: add new force_flush param
To enhance the test reports UX:
1. switching off/on passed/failed/skipped test for better visibility
2. better searching in test results
3. understanding the trends of execution for each test
4. better configurability of the final report
Enable allure adapter for all python tests.
Add tags and parameters to the test to be able to distinguish them across modes and runs.
Related: https://github.com/scylladb/qa-tasks/issues/1665
Related: https://github.com/scylladb/scylladb/pull/19335
Related: https://github.com/scylladb/scylladb/pull/18169Closesscylladb/scylladb#19942
* github.com:scylladb/scylladb:
[test.py] Clean duplicated arg for test suite
[test.py] Enable allure for python test
Allow to return from server_add() when a server reaches specified state.
One of:
- PROCESS_STARTED
- HOST_ID_QUERIED (previously called NOT_CONNECTED)
- CQL_CONNECTED (renamed from CONNECTED)
- CQL_QUERIED (was just QUERIED)
Also, rename CqlUpState to ServerUpState and move to internal_types.
Change the util function wait_for_view to read the view build status
from the system.view_build_status_v2 table which replaces
system_distributed.view_build_status.
The old table can still be used but it is less efficient because it's
implemented as a virtual table which reads from the v2 table, so it's
better to read directly from the v2 table. This can cause slowness in
tests.
The additional util function wait_for_view_v1 reads from the old table.
This may be needed in upgrade tests if the v2 table is not available
yet.
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.
The series adds the test for this scenario and the fix for the
chicken&egg problem above.
The series (or at least the fix itself) needs to be backported because
this is a serious regression.
Fixes: scylladb/scylladb#19025Closesscylladb/scylladb#20290
* github.com:scylladb/scylladb:
topology coordinator: fix indentation after the last patch
topology coordinator: do not add replacing node without a ring to topology
test: add test for replace in clusters with encryption enabled
test.py: add server encryption support to cluster manager
.gitignore: fix pattern for resources to match only one specific directory
The method starts a task that uses sstables_loader load-and-stream
functionality to bring new sstables into the cluster. The existing
load-and-stream picks up sstables from upload/ directory, the newly
introduced task collects them from S3 bucket and given prefix (that
correspond to the path where backup API method put them).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This adds minimal implementation of the start-backup API call.
The method starts a task that uploads all files from the given keyspace's snapshot to the requested endpoint/bucket. Arguments are:
- endpoint -- the ID in object_store.yaml config file
- bucket -- the target bucket to put objects into
- keyspace -- the keyspace to work on
- snapshot -- the method assumes that the snapshot had been already taken and only copies sstables from it
The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion (hint: it's good to have non-zero TTL value to make sure fast backups don't finish before the caller manages to call wait_task API).
Sstables components are scanned for all tables in the keyspace and are uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.
refs: #18391Closesscylladb/scylladb#19890
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add backup integration
docs: Document the new backup method
test/object_store: Test that backup task is abortable
test/object_store: Add simple backup test
test/object_store: Move format_tuples()
test/pylib: Add more methods to rest client
backup-task: Make it abortable (almost)
code: Introduce backup API method
database: Export parse_table_directory_name() helper
database: Introduce format_table_directory_name() helper
snapshot-ctl: Add config to snapshot_ctl
snapshot-ctl: Add sstables::storage_manager dependency
snapshot-ctl: Maintain task manager module
snapshot-ctl: Add "snapshots" logger
snapshot-ctl: Outline stop() method and constructor
snapshot-ctl: Inline run_snapshot_list<>
test/cql_test_env: Export task manager from cql test env
task_manager: Print task ttl on start (for debugging)
docs: Update object_storage.md with AWS_ environment
docs: Restructure object_storage.md
Increase pool size changes were recently reverted because of the flakiness for the test_gossip_boot test. Test started
to fail on adding the node to the cluster without any issues in the Scylla log file. In test logs it looked like the
installation process for the new node just hanged. After investigating the problem, I've found out that the issue is that
test.py was draining the io_executor pool for cleaning the directory during install that was set to eight workers. So
to fix the issue, io_executor pool should be increased to more or less the same ratio as it was: doubled cluster pool size.
Closesscylladb/scylladb#20276
The `keyspace_compaction` method incorrectly appends the column family
parameter to the URL using a regular string, `"?cf={table}"`, instead of
an f-string, `f"?cf={table}"`. As a result, the column family name is
sent as `{table}` to the server, causing the compaction request to fail.
Fix this issue by passing the parameter to the POST request using a
dictionary instead of appending it to the URL.
Fixes#20264
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20243
Namely:
- POST /storage_service/snapshots to take snapshot on a ks
- GET /task_manager/get_task_status/{id} to get status of a running task
- GET /task_manager/wait_task/{id} to wait for a task to finish
- POST /task_manager/abort_task/{id} to abort a running task
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method starts a task that uploads all files from the given
keyspace's snapshot to the requested endpoint/bucket. The task runs in
the background, its task_id is returned from the method once it's
spawned and it should be used via /task_manager API to track the task
execution and completion (hint: it's good to have non-zero TTL value to
make sure fast backups don't finish before the caller manages to call
wait_task API).
If snapshot doesn't exist, nothing happens (FIXME, need to return back
an error in that case).
If endpoint is not configured locally, the API call resolves with
bad-request instantly.
Sstables components are scanned for all tables in the keyspace and are
uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.
Task is not abortable (FIXME -- to be added) and doesn't really report
its progress other than running/done state (FIXME -- to be added too).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager.driver_connect() functions allows to pass parameters when
creating the connection (e.g., a special auth_provider), but unfortunately
right now the servers_add() function always calls driver_connect()
without parameters. So in this patch we just add a new optional
parameter to servers_add(), driver_connect_opts, that will be passed to
driver_connect().
In theory instead of the new option to driver_connect() a caller can
pass start=False to servers_add() and later call driver_connect()
manually with the right arguments. The problem is that start=False
avoids more than just calling driver_connect(), so it doesn't solve
the problem.
An example of using the new option is to run Scylla with authentication
enabled, and then connect to it using the correct default account
("cassandra"/"cassandra"):
config = {
'authenticator': 'PasswordAuthenticator',
'authorizer': 'CassandraAuthorizer'
}
servers = await manager.servers_add(1, config=config,
driver_connect_opts={'auth_provider':
PlainTextAuthProvider(username='cassandra', password='cassandra')})
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add more logging for raft-based topology operations in INFO and DEBUG
levels.
Improve the existing logging, adding more details.
Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).
Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.
These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).
Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945Closesscylladb/scylladb#19998
Introduce virtual tasks - task manager tasks which cover
cluster-wide operations.
Virtual tasks aren't kept in memory, instead their statuses
are retrieved from associated service when user requests
them with task manager API. From API users' perspective,
virtual tasks behave similarly to regular tasks, but they can
be queried from any node in a cluster.
Virtual tasks cannot have a parent task. They can have
children on each node in a cluster, but do not keep references
to them. So, if a direct child of a virtual task is unregistered
from task manager, it will no longer be shown in parent's
children vector.
virtual_task class corresponds to all virtual tasks in one
group. If users want to list all tasks in a module, a virtual_task
returns all recent supported operations; if they request virtual
task's status - info about the one specified operation is
presented. Time to live, number of tracked operations etc.
depend on the implementation of individual virtual_task.
All virtual_tasks are kept only on shard 0.
Refs: https://github.com/scylladb/scylladb/issues/15852
New feature, no backport needed.
Closesscylladb/scylladb#16374
* github.com:scylladb/scylladb:
docs: describe virtual tasks
db: node_ops: filter topology request entries
test: add a topology suite for testing tasks
node_ops: service: create streaming tasks
node_ops: register node_ops_virtual_task in task manager
service: node_ops: keep node ops module in storage service
node_ops: implement node_ops_virtual_task methods
db: service: modify methods to get topology_requests data
db: service: add request type column to topology_requests
node_ops: add task manager module and node_ops_virtual_task
tasks: api: add virtual task support to get_task_status_recursively
tasks: api: add virtual task support
tasks: api: add virtual tasks support to get_tasks
tasks: add task_handler to hide task and virtual_task differences from user
tasks: modify invoke_on_task
tasks: implement task_manager::virtual_task::impl::get_children
tasks: keep virtual tasks in task manager
tasks: introduce task_manager::virtual_task
We make two changes:
- we lease the IP address of a node that failed to boot because of
an expected error,
- we don't log "Cluster ... added ..." when a node fails to boot
because of an expected error.
Here are some examples of tests that don't work with no initial
nodes, but they should work:
1.
```
await manager.server_add(expected_error="...")
await manager.server_add()
```
2.
```
await manager.servers_add(2, expected_error="...")
await manager.servers_add(2)
```
3.
```
s1 = await manager.server_add(start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_add()
```
4.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_start(s2.server_id, expected_error="...")
await manager.servers_add(2)
```
5.
```
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
6.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.servers_add(2)
await manager.server_start(s1.server_id)
await manager.server_start(s2.server_id)
```
In this patch, we make a few improvements to make tests like the ones
presented above work. I tested all the examples above manually.
From now on, servers receive correct seeds if the first servers added
in the test didn't start or failed to boot.
Also, we remove the assertion preventing the creation of a second
cluster. This assertion failed the tests presented above. We could
weaken it to make these tests pass, but it would require some work.
Moreover, we have tests that intentionally create two clusters.
Therefore, we go for the easiest solution and accept that a single
`ScyllaCluster` may not correspond to a single Scylla cluster.
We change seeds in `ScyllaCluster.server_start` to all currently
running nodes. The previous code only pretended that it did it.
After doing this change, writing tests that create multiple clusters
is impossible. To allow it, we add the `seeds` parameter to
`ManagerClient.server_start`. We use it to fix and simplify the only
test that creates two clusters - `test_different_group0_ids`.
Add topology_tasks test suite for testing task manager's node ops
tasks. Add TaskManagerClient to topology_tasks for an easy usage
of task manager rest api.
Write a test for bootstrap, replace, rebuild, decommission and remove
top level tasks using the above.