Waiting for all tasks does not guarantee
that test will not spawn new tasks while we wait
Manager broken state prevents all future put requests in case of
1) fail during task waiting
2) Test continue to create tasks in test_after stage
To ensure the atomicity of tests and recycle clusters without any issues, it is crucial
that all active requests in ScyllaClusterManager are completed before proceeding further.
Topology tests might spawn asynchronous tasks in parallel in ScyllaClusterManager.
Tasks history is introduced to be able log and analyze all actions
against cluster in case of failures
The methods stop, stop_gracefully, and start in ScyllaServer
are not designed for parallel execution.
To circumvent issues arising from concurrent calls,
a start_stop_lock has been introduced.
This lock ensures that these methods are executed sequentially.
This patch adds metrics that will be reported per-table per-node.
The added metrics (that are part of the per-table per-shard metrics)
are:
scylla_column_family_cache_hit_rate
scylla_column_family_read_latency
scylla_column_family_write_latency
scylla_column_family_live_disk_space
Fixes#18642
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closesscylladb/scylladb#18645
incremental_reader_selector is the mechanism for incremental comsumption
of disjoint sstables on range reads.
tablet_sstable_set was implemented, such that selector is efficient with
tablets.
The problem is selector is vnode addicted and will only consider a given
set exhausted when maximum token is reached.
With tablets, that means a range read on first tablet of a given shard
will also consume other tablets living in the same shard. That results
in combined reader having to work with empty sstable readers of tablets
that don't intersect with the range of the read. It won't cause extra
I/O because the underlying sstables don't intersect with the range of
the read. It's only unnecessary CPU work, as it involves creating
readers (= allocation), feeding them into combined reader, which will
in turn invoke the sstable readers only to realize they don't have any
data for that range.
With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables
per tablet, there will be this amount of readers (empty or not):
(100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions.
~5000 times more readers, it can be quite significant additional cpu
work, even though I/O dominates the most in scans. It's an inefficiency
that we rather get rid of.
The behavior can be observed from logs (there's 1 sstable for each of
4 tablets, but note how readers are created for every single one of
them when reading only 1 tablet range):
```
table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}]
incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1}
sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}]
```
Fix is about making sure the tablet set won't select past the
supplied range of the read.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18556
Currently, all documentation links that feature anywhere in the help output of scylla-nodetool, are hard-coded to point to the documentation of the latest stable release. As our documentation is version and product (open-source or enterprise) specific, this is not correct. This PR addresses this, by generating documentation links such that they point to the documentation appropriate for the product and version of the scylladb release.
Fixes: https://github.com/scylladb/scylladb/issues/18276
- [x] the native nodetool is a new feature, no backport needed
Closesscylladb/scylladb#18476
* github.com:scylladb/scylladb:
tools/scylla-nodetool: make doc link version-specific
release: introduce doc_link()
build: pass scylla product to release.cc
There are two metrics to help observe base-write throttling:
* current_throttled_base_writes
* last_mv_flow_control_delay
Both show a snapshot of what is happening right at the time of querying
these metrincs. This doesn't work well when one wants to investigate the
role throttling is playing in occasional write timeouts.s Prometheus
scrapes metrics in multi-second intervals, and the probability of that
instant catching the throttling at play is very small (almost zero).
Add two new metrics:
* throttled_base_writes_total
* mv_flow_control_delay_total
These accumulate all values, allowing graphana to derive the values and
extract information about throttle events that happened in the past
(but not necessarily at the instant of the scrape).
Note that dividing the two values, will yield the average delay for a
throttle, which is also useful.
Closesscylladb/scylladb#18435
In commit 642f9a1966 (repair: Improve
estimated_partitions to reduce memory usage), a 10% hard coded
estimation ratio is used.
This patch introduces a new config option to specify the estimation
ratio of partitions written by repair out of the total partitions.
It is set to 0.1 by default.
Fixes#18615Closesscylladb/scylladb#18634
Closesscylladb/scylladb#18616
* github.com:scylladb/scylladb:
replica: Make it explicit table's sstable set is immutable
replica: avoid reallocations in tablet_sstable_set
replica: Avoid compound set if only one sstable set is filled
There's a loop that calculates the number of shard matches over a tablet
map. The check of the given shard against optional<shard> can be made
shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18592
As part of the unification process, alternator tests are migrated to the PythonTestSuite instead of using the RunTestSuite. The main idea is to have one suite, so there will be easier to maintain and introduce new features.
Introduce the prepare_sql option for suite.yaml to add possibility to run cql statements as precondition for the test suite.
Related: https://github.com/scylladb/scylladb/issues/18188Closesscylladb/scylladb#18442
The default limit of open file descriptors
per process may be too small for iotune on
certain machines with large number of cores.
In such case iotune reports failure due to
unability to create files or to set up seastar
framework.
This change configures the limit of open file
descriptors before running iotune to ensure
that the failure does not occur.
The limit is set via 'resource.setrlimit()' in
the parent process. The limit is then inherited
by the child process.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#18546
In b4e66ddf1d (4.0) we added a new batchlog_manager configuration
named delay, but forgot to initialize it in cql_test_env. This somehow
worked, but doesn't with clang 18.
Fix it by initializing to 0 (there isn't a good reason to delay it).
Also provide a default to make it safer.
Closesscylladb/scylladb#18572
* tools/cqlsh e5f5eafd...c8158555 (11):
> cqlshlib/sslhandling: fix logic of `ssl_check_hostname`
> cqlshlib/sslhandling.py: don't use empty userkey/usercert
> Dockerfile: noninteractive isn't enough for answering yet on apt-get
> fix cqlsh version print
> cqlshlib/sslhandling: change `check_hostname` deafult to False
> Introduce new ssl configuration for disableing check_hostname
> set the hostname in ssl_options.server_hostname when SSL is used
> issue-73 Fixed a bug where username and password from the credentials file were ignored.
> issue-73 Fixed a bug where username and password from the credentials file were ignored.
> issue-73
> github actions: update `cibuildwheel==v2.16.5`
Fixes: scylladb/scylladb#18590Closesscylladb/scylladb#18591
The code is based on similar idea as perf_simple_query. The main differences are:
- it starts full scylla process
- communicates with alternator via http (localhost)
- uses richer table schema with all dynamoDB types instead of only strings
Testing code runs in the same process as scylla so we can easily get various perf counters (tps, instr, allocation, etc).
Results on my machine (with 1 vCPU):
> ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload read --duration 10 2> /dev/null
...
median 23402.59616090321
median absolute deviation: 598.77
maximum: 24014.41
minimum: 19990.34
> ./build/release/scylla perf-alternator-workloads --workdir ~/tmp --smp 1 --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write --duration 10 2> /dev/null
...
median 16089.34211320635
median absolute deviation: 552.65
maximum: 16915.95
minimum: 14781.97
The above seem more realistic than results from perf_simple_query which are 96k and 49k tps (per core).
Related: https://github.com/scylladb/scylladb/issues/12518Closesscylladb/scylladb#13121
* github.com:scylladb/scylladb:
test: perf: alternator: add option to skip data pre-population
perf-alternator-workloads: add operations-per-shard option
test: perf: add global secondary indexes write workload for alternator
test: perf: add option to continue after failed request
test: perf: add read modify write workload for alternator (lwt)
test: perf: add scan workload for alternator
test: perf: add end-to-end benchmark for alternator
test: perf: extract result aggregation logic to a separate struct
in 906700d5, we accepted 0 as well as the return code of
"nodetool <command> --help", because we needed to be prepared for
the newer seastar submodule while be compatible with the older
seastar versions. now that in 305f1bd3, we bumped up the seastar
module, and this commit picked up the change to return 0 when
handling "--help" command line option in seastar, we are able to
drop the workaround.
so, in this change, we only use "0" as the expected return code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18627
in the same spirit of d57a82c156, this change adds `dist-unified` as one of the default targets. so that it is built by default. the unified package is required to when redistributing the precompiled packages -- we publish the rpm, deb and tar balls to S3.
- [x] cmake related change, no need to backport
Closesscylladb/scylladb#18621
* github.com:scylladb/scylladb:
build: cmake: use paths to be compatible with CI
build: cmake build dist-unified by default
password_authenticator::create_default_if_missing() is a confusing mix of
coroutines and continuations, simplify it to a normal coroutine.
Closesscylladb/scylladb#18571
our CI workflow for publishing the packages expects the tar balls
to be located under `build/$buildMode/dist/tar`, where `$buildMode`
is "release" or "debug".
before this change, the CMake building system puts the tar balls
under "build/dist" when the multi-config generator is used. and
`configure.py` uses multi-config generator.
in this change, we put the tar balls for redistribution under
`build/$<CONFIG>/dist/tar`, where `$<CONFIG>` is "RelWithDebInfo"
or "Debug", this works better with the CI workflow -- we just need
to map "release" and "debug" to "RelWithDebInfo" and "Debug" respectively.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in the same spirit of d57a82c156, this change adds `dist-unified`
as one of the default targets. so that it is built by default.
the unified package is required to when redistributing the precompiled
packages -- we publish the rpm, deb and tar balls to S3.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Most of the time only main set is filled, so we can avoid one layer
of indirection (= compound set) when maintenance set is empty.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently empty storage_groups are allocated for tablets that are
not on this shard.
Allocate storage groups dynamically, i.e.:
- on table creation allocate only storage groups that are on this
shard;
- allocate a storage group for tablet that is moved to this shard;
- deallocate storage group for tablet that is cleaned up.
Stop compaction group before it's deallocated.
Add a flag to table::cleanup_tablet deciding whether to deallocate
sgs and use it in commitlog tests.
During compaction_group::cleanup sstables set is updated, but
row_cache::_underlaying still keeps a shared ptr to the old set.
Due to that descriptors to deleted sstables aren't closed.
Refresh snapshot in order to store new sstables set in _underlying
mutation source.
Add rwlock which prevents storage groups from being added/deleted
while some other layers itereates over them (or their compaction
groups).
Add methods to iterate over storage groups with the lock held.
In the following patches, storage groups (and so also sstables sets)
will be allocated only for tablets that are located on this shard.
Some layers may try to read non-existing sstable sets.
Handle this case as if the sstables set was empty instead of calling
on_internal_error.
If allow_write_both_read_old tablet transition stage fails, move
to cleanup_target stage before reverting migration.
It's a preparation for further patches which deallocate storage
group of a tablet during cleanup.
Pass compaction group id to
shard_reshaping_compaction_task_impl::reshape_compaction_group.
Modify table::as_table_state to return table_state of the given
compaction group.
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.
Fixes#18283
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18465
PR #17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).
The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.
Closesscylladb/scylladb#18186
* github.com:scylladb/scylladb:
sstable_datafile_test: add testcase to test reclaim during reload
sstable_datafile_test: add test to verify auto reload of reclaimed components
sstables_manager: reload previously reclaimed components when memory is available
sstables_manager: start a fiber to reload components
sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
sstable_datafile_test: add test to verify reclaimed components reload
sstables: support reloading reclaimed components
sstables_manager: add new intrusive set to track the reclaimed sstables
sstable: add link and comparator class to support new instrusive set
sstable: renamed intrusive list link type
sstable: track memory reclaimed from components per sstable
sstable: rename local variable in sstable::total_reclaimable_memory_size
* seastar b73e5e7d...42f15a5f (27):
> prometheus: revert the condition for enabling aggregation
> tests/unit: add a unit test for json2code
> seastar-json2code: fix the path param handling
> github/workflow: do not override <clang++,23,release>
> github/workflow: add a github workflow for running tests
> prometheus: support disabling aggregation at query time
> apps/httpd: free allocated http_server_control
> rpc: cast rpc::tuple to std::tuple when passing it to std::apply
> stall-analyser: move `args` into main()
> stall-analyser: move print_command_line_options() out of Graph
> stall-analyser: pass branch_threshold via parameter
> stall-analyser: move process_graph() into Graph class
> scripts: addr2line: cache the results of resolve_address()
> stall-analyser: document the parser of log lines
> stall-analyser: move resolver into main()
> stall-analyser: extract get_command_line_parser() out
> stall-analyser: move graph into main()
> stall-analyser: extract main() out
> stall-analyser: extract print_command_line_options() out
> stall-analyser: add more typing annotatins
> stall-analyser: surround top-level function with two empty lines
> core/app_template: return status code 0 for --help
> iotune: Print file alignments too
> seastar-json2code: extract Parameter class
> seastar-json2code: use f-string when appropriate
> seastar-json2code: use nickname in place of oper['nickname']
> seastar-json2code: use dict.get() when checking allowMultiple
Closesscylladb/scylladb#18598
The code in `global_token_metadata_barrier` allows drain to fail.
Then, it relies on fencing. However, we don't send the barrier
command to a decommissioning node, which may still receive requests.
The node may accept a write with a stale topology version. It makes
fencing ineffective.
Fix this issue by sending the barrier command to a decommissioning
node.
The raft-based topology is moved out of experimental in 6.0, no need
to backport the patch.
Fixesscylladb/scylladb#17108Closesscylladb/scylladb#18599
Currently if any shard repair task fails,
`tablet_repair_task_impl` per-shard loop
breaks, since it doesn't handle the expection.
Although repair does return an error, which
is as expected, we change vnode-based repair
to make a best effort and try to repair
as much as it can, even if any of the ranges
failed.
This causes the `test_repair_with_down_nodes_2b`
dtest to fail with tablets, as seen in, e.g.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/tablets/job/gating-dtest-release-with-tablets/52/testReport/repair_additional_test/TestRepairAdditional/FullDtest___full_split002___test_repair_with_down_nodes_2b/
```
AssertionError: assert 1765 == 2000
```
- [x] ** Backport reason (please explain below if this patch should be backported or not) **
Tablet repair code will be introduced in 6.0, no need to backport to earlier versions.
Closesscylladb/scylladb#18518
* github.com:scylladb/scylladb:
repair: tablet_repair_task_impl: modernize table lookup
repair: tablet_repair: make best effort in spite of errors
Due to scylladb/seastar#2231, creating a scheduling group and a
scheduling group key is not safe to do in parallel. The service level
code may attempt to create scheduling groups while
the cql_transport::cql_sg_stats scheduling group key is being created.
Until the seastar issue is fixed, move initialization of the cql sg
states before service level initialization.
Refs: scylladb/seastar#2231Closesscylladb/scylladb#18581
When a tablet is migrated away, any inactive read which might be reading from said tablet, has to be dropped. Otherwise these inactive reads can prevent sstables from being removed and these sstables can potentially survive until the tablet is migrated back and resurrect data.
This series introduces the fix as well as a reproducer test.
Fixes: https://github.com/scylladb/scylladb/issues/18110Closesscylladb/scylladb#18179
* github.com:scylladb/scylladb:
test: add test for cleaning up cached querier on tablet migration
querier: allow injecting cache entry ttl by error injector
replica/table: cleanup_tablet(): clear inactive reads for the tablet
replica/database: introduce clear_inactive_reads_for_tablet()
replica/database: introduce foreach_reader_concurrency_semaphore
reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table()
reader_concurrency_semaphore: allow storing a range with the inactive reader
reader_concurrency_semaphore: avoid detach() in inactive_read_handle::abandon()