The pair of column_family/metrics/(total|live)_disk_space_used/{name}
reports the disk usage by sstables. The test creates table, populates,
flushes and checks that the size corresonds to what stat(2) reports for
the respective files.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This will result in new tables having at least 10 tablet replicas per
shard by default.
We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.
Fixes#21967
In some tests, we explicity set the initial scale to 1 as some of the
existing tests assume 1 compaction group per shard.
test.py uses a lower default. Having many tablets per shard slows down
certain topology operations like decommission/replace/removenode,
where the running time is proportional to tablet count, not data size,
because constant cost (latency) of migration dominates. This latency
is due to group0 operations and barriers. This is especially
pronounced in debug mode. Scheduler allows at most 2 migrations per
shard, so this latency becomes a determining factor for decommission
speed.
To avoid this problem in tests, we use lower default for tablet count per
shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we
could compensate by allowing more concurrency when migrating small
tablets, but there's no infrastructure for that yet.
I observed that with 10 tablets per shard, debug-mode
topology_custom.mv/test_mv_topology_change starts to time-out during
removenode (30 s).
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.
The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.
Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.
Adds core integration of the audit subsystem into Scylla's main process flow. Changes include:
- Import audit subsystem header
- Initialize audit system during server startup using configuration and token metadata
- Start audit system after API server initialization with query processor and memory manager
- Add proper shutdown sequence for audit system using RAII pattern
- Add error handling for audit system initialization failures
The audit system is now properly integrated into Scylla's lifecycle, ensuring:
- Correct initialization order relative to other subsystems
- Proper resource cleanup during shutdown
- Graceful error handling for initialization failures
In Scylla there are two options that control IO bandwidth limit -- the /storage_service/(compaction|stream)_throughput REST API endpoints. The endpoints are partially implemented and have no counterparts in the nodetool.
This set implements the missing bits and adds tests for new functionality.
Closesscylladb/scylladb#21877
* github.com:scylladb/scylladb:
nodetool: Implement [gs]etstreamthroughput commands
nodetool: Implement [gs]etcompationthroughput commands
test: Add validation of how IO-updating endpoints work
api: Implement /storage_service/(stream|compaction)_throughput endpoints
api: Disqualify const config reference
api: Implement /storage_service/stream_throughput endpoint
api: Move stream throughput set/get endpoints from storage service block
api: Move set_compaction_throughput_mb_per_sec to config block
util: Include fmt/ranges.h in config_file.hh
We still have a number of issues to be solved for views with tablets.
Until they are fixed, we should prevent users from creating them,
and use the vnode-based views instead.
This patch prepares the feature for enabling views with tablets. The
feature is disabled by default, but currently it has no effect.
After all tests are adjusted to use the feature, we should depend
on the feature for deciding whether we can create materialized views
in tablet-enabled keyspaces.
The unit tests are adjusted to enable this feature explicitly, and it's
also added to the scylla sstable tool config - this tool treats all
tables as if they were tablet-based (surprisingly, with SimpleStrategy),
so for it to work on views, the new feature must be enabled.
Refs scylladb/scylladb#21832Closesscylladb/scylladb#21833
The /column_family/compaction_strategy has GET and POST implemented, the
latter changes the strategy on the table.
Unknown strategy name implicitly renders internal server error code by
catching exception from compaction_strategy::type() that tries to
convert strategy name string to strategy enum class type.
This is to finish validation of #21533
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21569
There are now four of those and these are all the same in the way they
interpret the value parameter (though it's named differently)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When users start an operation asynchronously with API, they are expected to check the operation's status. Hence, the status should be kept in task manager for reasonable time after the operation is done. The operations that are started internally usually don't need to stay in task manager for that long.
Add api_task_ttl that will be used for tasks started with API. By default it's 1 hour. The time for which non-API tasks stay in task manager isn't changed.
Fixes: #21499.
Refs: #21425.
No backport needed - previous versions may use task_ttl
Closesscylladb/scylladb#21505
* github.com:scylladb/scylladb:
test: add test to check user_task_ttl
tasks: api: move make_task method
docs: nodetool: update backup and restore commands docs
docs: update task manager docs
nodetool: add nodetool tasks user-ttl command
node_ops: use user task ttl for node ops virtual task
tasks: use user_task_ttl for tasks started by user
api: task_manager: add /task_manager/user_ttl to get and set user task ttl
tasks: add task_manager::task::is_user_task method
tasks: keep updateable_value of task_ttl in task manager
db: config: add user_task_ttl_seconds named value
Stop taking snapshots of MVs and allow taking snapshot of individual tables, now one can take a snapshot of any base table, any view or index. Also add tests to cover new cases both boost test (using cc code) and pytest (using the API)
Also, update documentation to reflect the change
fixes: #21339fixes: #20760Closesscylladb/scylladb#21433
Python and Python developers don't like directory names to include a
minus sign, like "cql-pytest". In this patch we rename test/cql-pytest
to test/cqlpy, and also change a few references in other code (e.g., code
that used test/cql-pytest/run.py) and also references to this test suite
in documentation and comments.
Arguably, the word "test" was always redundant in test/cql-pytest, and
I want to leave the "py" in test/cqlpy to emphasize that it's Python-based
tests, contrasting with test/cql which are CQL-request-only approval
tests.
Fixes#20846
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For a table with NullCompactionStrategy and
TimeWindowCompactionStrategy, the test
- inserts a bunch of data and flushes the table
- deletes/update some data, delete a range of data and flushes
the table
- Triggers a major compaction and calls for compactionhistory
to retrieve and validate the histogram
This option was silently broken when --enable-tablet's default changed
from false to true. The reason is that when --vnodes is passed, run only
removes --enable-tablets=true from scylla's command line. With the new
default this is not enough, we need to explicitely disable tablets to
override the default.
Closesscylladb/scylladb#20462
Increase pool size changes were recently reverted because of the flakiness for the test_gossip_boot test. Test started
to fail on adding the node to the cluster without any issues in the Scylla log file. In test logs it looked like the
installation process for the new node just hanged. After investigating the problem, I've found out that the issue is that
test.py was draining the io_executor pool for cleaning the directory during install that was set to eight workers. So
to fix the issue, io_executor pool should be increased to more or less the same ratio as it was: doubled cluster pool size.
Closesscylladb/scylladb#20276
Virtual tasks are supported by get_task_status, abort_task and
wait_task.
Task status returned by get_task_status and wait_task:
- contains task_kind to indicate whether it's virtual (cluster) or
regular (node) task;
- children list apart from task_id contains node address of the task.
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.
Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.
Fixes: #19351.
Closesscylladb/scylladb#19834
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.
Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.
This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.
Fixesscylladb/scylladb#19772
Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.
Closesscylladb/scylladb#19787
In CI test always executed with option --repeat=3 that leads to generate 3 test results with the same name. Junit plugin in CI cannot distinguish correctly the difference between these results. In case when we have two passes and one fail, the link to test result will sometimes be redirected to the incorrect one because the test name is the same.
To fix this ReportPlugin added that will be responsible to modify the test case name during junit report generation adding to the test name mode and run id.
Fixes: https://github.com/scylladb/scylladb/issues/17851
Fixes: https://github.com/scylladb/scylladb/issues/15973
Task manager's tasks stay in memory after they are finished.
Moreover, even if a child task is unregistered from task manager,
it is still alive since its parent keeps a foreign pointer to it. Also,
when a task has finished successfully there is no point in keeping
all of its descendants in memory.
The patch introduces folding of task manager's tasks. Whenever
a task which has a parent is finished it is unregistered from task
manager and foreign_ptr to it (kept in its parent) is replaced
with its status. Children's statuses of the task are dropped unless
they or one of their descendants failed. So for each operation we
keep a tree of tasks which contains:
- a root task and its direct children (status if they are finished, a task
otherwise);
- running tasks and their direct children (same as above);
- a statuses path from root to failed tasks.
/task_manager/wait_task/ does not unregister tasks anymore.
Refs: #16694.
- [ ] ** Backport reason (please explain below if this patch should be backported or not) **
Requires backport to 6.0 as task number exploded with tablets.
Closesscylladb/scylladb#18735
* github.com:scylladb/scylladb:
docs: describe task folding
test: rest_api: add test for task tree structure
test: rest_api: modify new_test_module
tasks: test: modify test_task methods
api: task_manager: do not unregister task in /task_manager/wait_task/
tasks: unregister tasks with parents when they are finished
tasks: fold finished tasks info their parents
tasks: make task_manager::task::impl::finish_failed noexcept
tasks: change _children type
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.
The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18898
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
`database::find_column_family()` throws no_such_column_family
if an unknown ks.cf is fed to it. and we call into this function
without checking for the existence of ks.cf first. since
"/storage_service/tablets/move" is a public interface, we should
translate this error to a better http error.
in this change, we check for the existence of the given ks.cf, and
throw an exception so that it can be caught by seastar::httpd::routers,
and converted to an HTTP error.
Fixes#17198
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17217
Since 6b87778 regular compaction tasks are removed from task manager
immediately after they are finished.
test_regular_compaction_task lists compaction tasks and then requests
their statuses. Only one regular compaction task is guaranteed to still
be running at that time, the rest of them may finish before their status
is requested and so it will no longer be in task manager, causing the test
to fail.
Fix statuses check to consider the possibility of a regular compaction
task being removed from task manager.
Fixes: #17776.
Closesscylladb/scylladb#17784
This change introduces a logic, that is responsible
for checking if tablets are enabled for any of
keyspaces when get_ownership() is invoked.
Without it, the result would be calculated
based solely on sorted_tokens() which was
invalid.
Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Before this change, when user tried to utilize
'storage_service/ownership/{keyspace}' API with
keyspace parameter that uses tablets, then internal
error was thrown. The code was calling a function,
that is intended for vnodes: get_vnode_effective_replication_map().
This commit introduces graceful handling of such scenario and
extends the API to allow passing 'cf' parameter that denotes
table name.
Now, when keyspace uses tablets and cf parameter is not passed
a descriptive error message is returned via BAD_REQUEST.
Users cannot query ownership for keyspace that uses tablets,
but they can query ownership for a table in a given keyspace that uses tablets.
Also, new tests have been added to test/rest_api/test_storage_service.py and
to test/topology_experimental_raft/test_tablets.py in order to verify the behavior
with and without tablets enabled.
Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
This change is intended to introduce tests for vnodes for
the following API paths:
- 'storage_service/ownership'
- 'storage_service/ownership/{keyspace}'
In next patches the logic that is tested will be adjusted
to work correctly when tablets are enabled. This is a safety
net that ensures that the logic is not broken.
Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Fix test_storage_service.py to work with tablets.
- test_describe_ring was failing because in storage_service/describe_ring
table must be specified for keyspaces with tablets.
Do not check the status if tablets are enabled. Add checks for
specified table;
- test_storage_service_keyspace_cleanup_with_no_owned_ranges
was failing because cleanup is disabled on keyspaces with tablets.
Use test_keyspace_vnodes fixture to use keyspace with tablet disabled;
- test_storage_service_get_natural_endpoints required
some minor type-related fixes.
Fix test_compaction_task.py to work with tablets.
Currently test fail because cleanup on keyspace with tablets is
disabled, and reshape and reshard of keyspace with tablets uses
load_and_stream which isn't covered by tasks.
Use test_keyspace_vnodes for these tests to have a keyspace with
tablets disabled.
To allow to filter the returned keyspaces based by the replication they
use: tablets or vnodes.
The filter can be disabled by omitting the parameter or passing "all".
The default is "all".
Fixes: #16509Closesscylladb/scylladb#17319
This API endpoint was failing when tablets were enabled
because of usage of get_vnode_effective_replication_map().
Moreover, it was providing an error message that was not
user-friendly.
This change extends the handler to properly service the incoming requests.
Furthermore, it introduces two new test cases that verify the behavior of
storage_service/range_to_endpoint_map API. It also adjusts the test case
of this endpoint for vnodes to succeed when tablets are enabled by default.
The new logic is as follows:
- when tablets are disabled then users may query endpoints
for a keyspace or for a given table in a keyspace
- when tablets are enabled then users have to provide
table name, because effective replication map is per-table
When user does not provide table name when tablets are enabled
for a given keyspace, then BAD_REQUEST is returned with a
meaningful error message.
Fixes: scylladb#17343
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#17372
per its description, "`/storage_service/describe_ring/`" returns the
token ranges of an arbitrary keyspace. actually, it returns the
first keyspace which is of non-local-vnode-based-strategy. this API
is not used by nodetool, neither is it exercised in dtest.
scylla-manager has a wrapper for this API though, but that wrapper
is not used anywhere.
in this change, this API is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17197